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ABSTRACT 


Navigation without external beacon systems has been a long-standing goal within 
the autonomous systems community. Terrain-aided navigation (TAN) presents an 
alternative to external localization but requires a prior map. In many cases, the unmanned 
vehicle (UV) is surveying areas where no map exists and therefore TAN cannot be 
employed. Active terrain-aided navigation (ATAN) incorporates reinforcement learning 
(RL) into TAN to reduce the dependence on a prior map. The spatial and 
temporal measurement uncertainties create the classical problem of 
exploration versus exploitation: the desire to explore all map areas while 
exploiting known areas to minimize position error. A dual stochastic optimal 
estimation problem models the exploration-exploitation dilemma through an 
information theoretic framework (ITF) that maximizes information gain in 
real time: a challenge with high computational complexity. The 

computational cost is reduced using an RL technique called a partially observable 
Monte Carlo planning process (POMCP). ATAN employs the ITF and 
POMCP to provide a better, more flexible coverage plan for TAN. The results of 
this thesis demonstrate the ability of a UV to navigate autonomously without 
a prior map while minimizing position error, ensuring total coverage, and 
creating an accurate map within time/energy constraints. Through ATAN, 
greater levels of autonomy are exhibited, improving upon the TAN 
framework but also providing a larger offering to the field of robotics. 
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Executive Summary 


Navigation, without the dependence on external beacon systems, has been a long-standing 
goal for the U.S. Navy. Never has it been more important. Current external beacon 
systems such as global positioning system (GPS), acoustic transponders, and the Positioning 
System for Deep Ocean Navigation (POSYDON) have significant limitations: GPS may be 
jammed or spoofed, and GPS signals do not penetrate water; undersea acoustic transponders 
have range and accuracy limitations and require manual set-up; and POSYDON has a 
combination of the GPS and acoustic transponder limitations and is dependent on the 
deployment and operation of a large array of resources covering the ocean vastness. 

One solution, known as terrain-aided (or relative) navigation (TAN), uses a terrain map 
coupled with onboard sensing systems to determine position through a correlation process. 
However, TAN requires a prior map. For autonomous systems, such as unmanned vehicles 
(UVs), frequently there is no accurate prior map since the mission purpose is to survey the 
area for the first time. 

This research seeks to improve TAN by de-emphasizing the requirement for a prior map. 
Instead the approach actively builds a map as the vehicle conducts a terrain survey. We 
call this new approach active terrain-aided navigation (ATAN). Key to the approach is 
the balancing of exploration and exploitation objectives, where exploration is the goal of 
covering the entire area with the UV sensor system (e.g., sonar or radar) and exploitation is 
the re-localization of the UV position through exploiting the best terrain features. 

The TAN approach is designed as an information theoretic framework where all components 
of the estimation and optimization software are designed to maximize information gain. 
This approach brings with it a challenge of high computational complexity due to the near 
real time optimization routine. However, the computational complexity can be reduced 
by leveraging one of the most recent techniques in Artificial Intelligence Reinforcement 
Learning called a partially observable Monte Carlo planning process (POMCP), which can 
be used to solve the dual exploration-exploitation optimization problem. 

In conclusion, ATAN gives greater levels of autonomy that allow the UV to adapt to its 
environment and permit more flexibility in a variety of mission areas. 
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CHAPTER 1: 

Introduction to Active Terrain-Aided Navigation 


The ability to navigate accurately is a well-established problem that has been made con¬ 
siderably easier by the invention and implementation of external localization methods such 
as Global Positioning System (GPS) and acoustic beacon systems. These systems have 
become increasingly integrated into the global infrastructure of today, even to a point of 
dependence. This dependence has led to a growing concern about the system vulnerabili¬ 
ties [1] and a desire for navigation solutions independent from external localization systems. 
Over the last twenty-five years, there has been significant progress in removing unmanned 
system dependence on external localization within the domain of terrain-aided (or relative) 
navigation (TAN). However, the best current techniques in TAN require an accurate prior 
map in order to estimate position. In many instances, a prior map is not available or does 
not exist, making TAN ineffective without external position localization. 

1.1 Motivation 

The goal of this thesis is to improve upon existing TAN methods using advances in machine 
learning, specifically artificial intelligence reinforcement learning, to de-emphasize the 
requirements of external localization and an a priori map. The solution demonstrates the 
ability for an unmanned system to navigate autonomously while also fulfilling the following 
objectives: minimize total positional error; completely cover a defined, bounded region 
with a designated sensor; create an accurate map to within a certain accuracy; and complete 
the task within time and energy constraints. By accomplishing this goal, the unmanned 
system will have the ability to quantify and make decisions about its environment, allowing 
the unmanned system to be more autonomous. 

1.2 Statement 

An external localization source (e.g., GPS) may not always be guaranteed and, when not 
available, results in positional uncertainty for the unmanned vehicle. For a robust system, 
the vehicle must instead conduct position estimation using terrestrial references. Mathe- 
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matically this can be represented using the following process and measurement models: 


Xk+i = /(xk,ak) + mk 
z k = M x k) + 


in which the updated state estimate x^+i and sensor measurement vector z^- utilize an vehicle 
state vector x&, a control input a^, and a terrain map h(-) : R 2 —> R while also including 
process noise co e R w and measurement noise v 6 R v . 

The position estimate and map estimate can be initialized through an external localization 
source or an arbitrary zero-zero point: the initial position estimate xo builds an initial map 
estimate mo at xq. 

Previous TAN approaches provide an underlying assumption of a prior high-confidence, 
low-error overall map estimate. This research seeks to degrade the emphasis of an accurate 
prior map by investigating a new approach to path planning through simultaneously opti¬ 
mizing the position and map non-linear functions (equation 1.1). Through careful trajectory 
planning, a balance between exploration, the desire to discover new areas, and exploitation, 
the requirement to localize position using previously explored areas, can be achieved. The 
required map accuracy, vehicle dynamics, and time (energy) add further constraints on 
the exploration-exploitation relationship to achieve the mission goal: building an accurate 
survey map. 


1.3 Methodology 

Traditional TAN serves as an alternative to external localization methods by using the 
correlation between sensor data and terrain data to minimize the positional uncertainty 
of the vehicle over a variety of terrain types and domains. For example, in the 1970s 
missile guidance systems used radar to conduct position localization through terrain contour 
matching (TERCOM) [2], [3] and, in this thesis, the autonomous underwater vehicle (AUV) 
uses sonar sensors as inputs to position estimation through active terrain-aided navigation 
(ATAN). The breadth of traditional TAN includes a large range of applications which allows 
for this research to be applied to all domains, including aerial, surface, and underwater 
vehicles. 
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The information process flow within ATAN is depicted in Figure 1.1. Initially, the vehicle 
follows a default navigational path to the target area while also conducting position estimator 
initialization. The cross-track error (CTE) controller implements the intended path by giving 
control orders to the unmanned vehicle (UV). As the UV continues through the trajectory, 
set and drift occur, taking the UV off course. An initial position estimate is made by the UV 
based on its inertial navigation system (INS) and dead-reckoning systems. Next, the UV 
periodically classifies the surrounding terrain and outputs a point cloud to the navigation 
system which conducts position estimation and updates the map estimate. 


Disturbances 


Initial 
way points 


Trajectory Planner 



Spatial Map 
Estimate 
(GPM) 


Sensor 

System 


Navigation System 


Active Terrain-Aided Navigation 


Figure 1.1. Information Process Flow for Unmanned Vehicles Using ATAN. 


Finally, the trajectory planner employs the position estimate and map estimate to generate 
and evaluate possible trajectories, selecting the best trajectory and outputting the desired 
heading rate. 

In many coverage problems, the best trajectory is determined by leveraging exploration 
and exploitation : the desire to explore all areas of the map is challenged by the obligation 
to reduce the positional uncertainty of the autonomous vehicle by exploiting known areas. 
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The autonomous vehicle must balance exploration and exploitation such that when the 
positional uncertainty is too great, the vehicle has a higher incentive to look for terrain 
advantageous to localization and when position uncertainty is low, the vehicle conducts 
exploration. This tension between exploration and exploitation creates the stochastic dual 
optimization problem that can be managed through adaptive gains of a reward function. 

The partially observable Monte Carlo planning process (POMCP), is a combination of 
a partially observable Markov Decision Process (POMDP) and Monte Carlo tree search 
(MCTS), and was theoretically proposed by Lauri and Ritala to solve the exploration- 
exploitation dilemma [4], described more completely in Chapter 7. The solution of a 
POMDP balances exploration and exploitation through at least one optimal policy, n*, 
which gives a solution over a finite horizon, with the computational complexity increasing 
as the number of finite horizons increase [4]. To counteract the increase of computational 
complexity, the POMDP can be simplified with recent machine learning techniques used in 
AlphaGo [5] which uses a MCTS over all POMDP trajectories and leads to a computationally 
manageable POMCP [6], [7]. Selective searching the POMCP through biasing [6], [7] 
allows for the determination of the "best" solution within a limited amount of time. 

Through ATAN, the dual optimization problem between exploration and exploitation effi¬ 
ciently, and effectively, determines the best trajectory within the allowable computational 
complexity, permitting real-time application and greater levels of autonomy. 

1.4 Active Terrain-Aided Navigation 

An information-theoretic framework—ATAN—was created by combining TAN and the 
POMCP in order to provide a solution to the dual optimization problem between exploration 
and exploitation within the TAN environment. The key enhancement of ATAN stresses 
autonomy, interpreting sensor information and quantifying the environment into a map. By 
quantifying exploration and exploitation , the vehicle can decide its own trajectory, making 
the unmanned system autonomous. ATAN is comprised of four elements: 

• Position Estimator (particle filter (PF)) Sec. 3.2 

• Map Estimator (Gaussian Process Model (GPM)) Sec. 4.1 

• Terrain Evaluator (Maximum a posteriori (MAP)) Sec. 5.4 

• Trajectory Planner (POMCP) Sec. 7.4. 
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Using artificial intelligence reinforcement learning, ATAN gives the possibility of improving 
TAN by quantifying the environment with a spatial estimator in a Gaussian Process Model 
(GPM) for exploration which can be leveraged against a temporal Maximum A Posteriori 
(MAP) for exploitation. Using the GPM and MAP as inputs to a reward function, a trajectory 
planner (POMCP) calculates current and future states to determine the optimal trajectory, 
effectively and efficiently mapping an area. Together, the four components of ATAN have 
the potential to create an accurate map with no a priori information and enables the AUV 
to dynamically plan and execute a trajectory while balancing exploration and exploitation 
aspects. 

Put simply, ATAN is TAN with the addition of artificial intelligence reinforcement learning: 
a real-time coverage planner that eliminates the requirement of an accurate a priori map. 

1.5 Applications of Active Terrain-Aided Navigation 

Autonomous systems can improve numerous operations, both militarily and non-militarily, 
by terminating the dependence of external beacon systems on localization. Possible missions 
in the military include but are not restricted to the following: 

• map an area prior to an amphibious operation to determine, and subsequently avoid, 
obstacles 

• conduct searches over a large areas 

• perform mine or explosive ordinance countermeasure operations 

• survey new or inaccessible geographic terrain 

• carry out Intelligence, Surveillance, and Reconnaissance (ISR). 

In non-military operations, the unmanned systems can also be used for a variety of tasks: 

• monitor habitats, count, or observe aquatic or land animals [8], [9], [10] 

• assess coral reefs [11] or plant height [12] 

• conduct Emergency Response [13] 

• perform security operations [14] 

• expedite search and rescue operations [15]. 

Regardless of the operation, ATAN is a versatile navigation approach that can improve 
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current methods, provide better results, and improve unmanned system autonomy. 


1.6 Autonomous Underwater Vehicle System Description 

This thesis demonstrates ATAN in an underwater environment though a MATLAB sim¬ 
ulation based on a modified Remote Environmental Monitoring UnitS (REMUS) 100. 
The modified REMUS 100 (Figure 1.2) operated by Center for Autonomous Vehicle Re¬ 
search (CAVR) at the Naval Postgraduate School (NPS) can support a wide variety of 
mission sets, including hydro-graphic surveys, environmental monitoring, and mapping. 



Figure 1.2. REMUS 100 Vehicle. 


The system description and implementation, shown in Table 1.1, was modeled with param¬ 
eters shown in italics. The REMUS also has a GPS antenna and the ability to use acoustic 
transponders of the Long BaseLine (LBL) and Ultra Short BaseLine (USBL) systems. 


6 



Table 1.1. REMUS Configuration and Model Parameters. 


Part/Module 

Used Setup/Utilized Implementation 

Robot 

REMUS 100. Length 2.28m. Diameter 0.19m. 
Weight 40kg. Speed 0.25-2.57 m/s. Depth Range 
3m-100m. Battery Duration 22 hours. [16], [17] Sim¬ 
ulated Speed 1.8 m/s. 

Propulsion 

Direct drive DC brush-less motor directly connected 
to open three bladed propeller [16], [17]. 

Locomotion Con¬ 
trol 

2 coupled yaw and pitch fins. Cross Tunnel 

Thrusters [17]. Simulated Heading Angular Rates 

1-2 -- -- 0 - - 21 % 

Inertial Naviga¬ 
tion System (INS) 

SeaDeViL INS navigational uncertainty equal to 0.5 
percent distance traveled [16], [17]. Simulated uncer¬ 
tainty equal to 0.7percent distance traveled. 

Acoustic Doppler 

Current Profiler 

(ADCP) 

900 kHz, four downward-looking, four upward- 
looking transponders measure forward and side ve¬ 
locity of vehicle [16], [17]. 

Sonar 

BlueView MB2250-N downward-looking, ultra-high 
resolution sonar that operates 256 beams 0.18m apart 
providing 0.6cm resolution over a 40° field of view 
[18]. Simulated Sonar as 10m circular patch. 

Computational 

Resources 

(2) intel Core i9 processors for the secondary con¬ 
troller and sensor processing. (2) Hydroid CPUs and 

hard drives for the main controller and the side scan 

sonar. 

Map 

3D grid map with grid size lmxlm. 
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1.7 Structure 

The remainder of this thesis is organized as follows. Chapter 2 discusses the most recent 
research and methodologies for underwater mapping. Chapter 3 examines different types of 
position estimation techniques, specifically Kalman filter (KF) and particle filter (PF), and 
shows the ATAN implementation of the PF. Chapter 4 and 5 define and derive exploration 
and exploitation as a Gaussian Process Model (GPM) and Maximum a posteriori (MAP) 
estimate, respectively. Exploration and exploitation are then used to define the reward 
function, using adaptive gains, in Chapter 6, presenting the stochastic dual optimization 
equation. In Chapter 7, the methodology of Markov Decision Processes (MDPs), partially 
observable Markov Decision Processes (POMDPs), and Monte Carlo tree search (MCTS) 
are reviewed and then combined to form the partially observable Monte Carlo planning 
process (POMCP). Simulation results from MATLAB are presented and analyzed in 
Chapter 8. The thesis concludes with Chapter 9 which discusses future work. 
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CHAPTER 2: 
Background 


This chapter provides background and a literature review for Active Terrain Aided Naviga¬ 
tion (ATAN). The relevant literature includes several related fields but emphasizes robotic 
autonomy. Subcategories include position estimation, geophysical navigation techniques, 
coverage planning algorithms, information theory, and artificial intelligence reinforcement 
learning techniques for trajectory planning. 

2.1 Position Estimation 

The accuracy of the map is entirely dependent on the position estimate of the unmanned 
vehicle (UV) carrying the sensor system. In most instances, positional errors are nonlinear 
and the UV requires the capability of position estimation through either external local¬ 
ization sources or through an internal correlation of sensor information and the terrain. 
However, many of the methods utilize external localization sources which have operational 
vulnerabilities for both civilian and military missions. 

2.1.1 GPS Vulnerability 

GPS utili z es a constellation of satellites which broadcast a microwave signal used by a 
GPS receiver to localize the position of the UV. While radically changing the world with 
respect to all aspects of economic, military, and recreational tasks, GPS also possesses 
vulnerabilities and limitations. 

Jamming and spoofing are two inherent vulnerabilities within GPS. Jamming can be used 
defensively or offensively by transmitting an additional noise signal across one or more 
GPS frequencies, raising the overall noise level at the GPS receiver which can generate 
more error and cause an overload at the GPS receiver or a loss of GPS lock. In Russia, GPS 
jammers are placed in key areas, in accordance with its defense initiative Pole-21, to protect 
its domestic infrastructure in the event of a war [19]. In an offensive sense, GPS jamming 
was reportedly used by China to jam GPS on U.S. drones while the drones conducted 
reconnaissance during territorial disputes in the Spratly Islands [19]. While GPS jamming 
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will affect the ability of the UV to conduct localization, spoofing may be more dangerous. 
Spoofing transmits a false GPS signal indistinguishable from the real, correct GPS signal 
to either provide a false position or in a more extreme situation, take control of the UV. 
In 2016, two U.S. Navy ships were suspected of having their GPS systems tampered with, 
causing them to violate international waters. In 2011, Iran used spoofing to capture a U.S. 
drone by interfering with its GPS and having the drone land in Iran when the drone GPS 
read a position that was in Afghanistan [20]. Both jamming and spoofing exploit significant 
vulnerabilities to GPS which can lead to the failure of a mission, damage or destruction of 
vital equipment, or loss of human life. 

For AUVs which predominately operate underwater, GPS has another glaring vulnerability: 
GPS signals cannot penetrate water. Once the AUV is underwater, the GPS signal can no 
longer be used for localization, making the AUV reliant on onboard sensors. For operations 
in deep water, or for long duration operations, the AUV must re-surface to obtain a fix 
which consumes valuable time, energy, and resources while also risking a possible counter 
detection event or collision in shallow water high contact density environments. 

The vulnerabilities of GPS are well-known, leading to a desire of reducing the dependence 
of position localization on GPS [21]. 

2.1.2 Acoustic Undersea Beacon Systems 

In the undersea domain, acoustic beacon systems are a viable method for position estimation 
that can compliment GPS. With an acoustic beacon system, the AUV triangulates position 
through the deliberate, pre-planned placement of fixed undersea acoustic beacons. Common 
systems used today are Long Baseline (LBL), Ultra Short Baseline (USBL), and Short 
Baseline (SBL), or some combination of the three [22]. Essentially, the different types of 
beacon systems correspond to the different range and geometry between the beacon system 
and the AUV. Regardless of the system, a beacon can operate as one of the following: 

• Transponder: beacon receives predetermined interrogation pulse and transmits an 
acoustic ping 

• Responder: beacon transmits a predetermined pulse after its electrically triggered 

• Pinger: beacon transmits acoustic signals continually on a specific frequency and 
pulse length. 
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However, beacons have limitations. In both a monetary and temporal sense, beacons have 
significant costs while providing limited accuracy. The ocean is a corrosive environment, 
and time needs to be devoted to ensure that beacons are properly maintained, calibrated, and 
operated. Further, beacons offer limited exactness within a finite range—±10m over 2000 
meters for LBL and ±lm over 1500 meters for USBL [16]—and must be placed prior to 
use, which may be inconvenient or impracticable in some places like conducting under-ice 
operations. Additionally, some military operations may require a covert posture or execution 
within a limited time. Using active acoustic beacons can lead to the AUV being counter 
detected or exceeding time constraints through placing and retrieving beacons, minimizing 
operational flexibility. 

2.1.3 Positioning System for Deep Ocean Navigation (POSYDON) 

Defense Advanced Research Projects Agency (DARPA) is making a hybrid system between 
GPS localization and beacon localization called Positioning System for Deep Ocean Navi¬ 
gation (POSYDON) [23]. A surface buoy, using GPS to localize itself, can act as a beacon 
system to transmit underwater acoustic signals. This coordinated buoy system attempts to 
imitate actual GPS using multiple small long-range acoustic sources (satellites) to allow 
accurate positioning via acoustic signals (microwaves) without the AUV surfacing for a GPS 
fix. While Phase I of a three-phase project is underway, there are still multiple sources of 
error and vulnerabilities from using GPS and the acoustic beacons, as discussed in Sections 
2 . 1 . 1 - 2 . 1 . 2 . 

2.2 Geophysical Navigation 

Autonomous vehicles require a methodology to estimate position. Two geophysical navi¬ 
gation techniques—TAN and simultaneous localization and mapping (SLAM)—minimize 
the emphasis on external beacon systems by using terrain or physical features to produce a 
position estimate. Both TAN and SLAM over the last couple of years have made signifi¬ 
cant progress in navigation independent from external beacon systems and provide notable 
contributions to ATAN. 
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2.2.1 Terrain Aided (Relative) Navigation (TAN) 

TAN is a substantial building block with multiple attributes contributing to ATAN. TAN 
reduces positional uncertainty of an unmanned vehicle by correlating sensor data and a 
known terrain map, as shown in Figure 2.1. 



Figure 2.1. Correlation of Observation (left) and Terrain (right) in One 
Dimension. Source: [24], 


The highest correlation between the sensor data and the known map, shown in Figure 2.1 
(right), is the position estimate. In the underwater domain, TAN has become a popular 
choice of position estimation and navigation, as GPS is unable to penetrate water at depth. 

In 2004, Nygren and Jansson introduced the Correlator (Correlation) Method—a recursive 
position estimation by correlating multi-beam sensors and terrain information—to use 
on unmanned aerial vehicles (UAVs), AUVs, and submarines [25]. The following year, 
Nygren utilized a submersible vehicle that showcased significant improvement of position 
estimation over a 65 km test track, yielding root mean squared (RMS) error less than one 
meter for simulation and a few meters for the sea trial [26]. Two main outcomes of Nygren’s 
dissertation were well known sensor beam placements converged to a Gaussian distribution 
when the number of sensor beams approached infinity, and that the accuracy of the position 
estimation can be judged using the lower bound on the variance, Cramer-Rao Lower Bound 
(CRLB) [26], 
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At Stanford, Rock and Kimball discussed and implemented Terrain Relative Navigation 
(TRN) to combat issues listed previously in this chapter. Specifically, Rock and Kimball 
presented TRN to navigate [27] and localize [28] the AUV position underneath icebergs. 
Since icebergs are constantly moving (either linearly or rotationally), the AUV can conduct 
relative localization using a particle filter and the correlation between a a priori map and 
its sensor measurements [28]. 

Gao, Peng, Zhou, Wang, and Xu proposed an improved matching algorithm (Terrain Match¬ 
ing Localization with Gradient Fitting) to minimize computational complexity and raise 
positional accuracy of the position estimator [29]. By partitioning the area into smaller 
sections, the gradient of each section can be fitted to a normal distribution and matched 
to the vehicle observations, reducing the number of comparisons required and reducing 
interference of repetitive terrain [29]. 

One of the main, well documented difficulties within TAN has been localizing over fea¬ 
tureless terrain [26], [30], [31]. Nygren proposed increasing the sensor beam count which 
gives the ability to calculate smaller terrain gradients through interpolation [26], [32]. 
Later, Meduna, Ewan, and Rock showed that interpolation leads to higher computational 
complexity and possible over-fitting for minimal performance gains [32], [33] and a more 
robust solution was presented by Dektor and Rock by exponentially weighting PF outputs 
( p a where 0 < a < \ ) based on map error, sensor error, and terrain information [30]. 
In subsequent work, Dektor and Rock built on the exponentially weighting PF outputs by 
examining reasons and proposing a methodology for detecting false convergence of the PF, 
which successfully allowed an AUV to localize while driving over flat terrains [31]. 

2.2.2 Simultaneous Localization and Mapping (SLAM) 

SEAM has been largely viewed as the precursor for true autonomy [34]—concurrently 
conducting AUV position localization with progressive map building [35]—and consists of 
two components, front end and back end, shown in Figure 2.2. 
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Figure 2.2. SLAM Front End and Back End Diagram. Source: [36]. 


The front end extracts topological features from sensor data and assigns measurements to 
each feature while the back end calculates a map estimate [36]. One of the more important 
aspects of SLAM is data association, or more specifically loop closure. As the AUV drives 
over a previously explored area, the current loop closes, anchoring the map estimate and 
reducing the uncertainties in robot and feature estimates [34]. Without loop closure, the 
SLAM problem exhibits quadratic growth [34], for each new feature encountered could lead 
to failure of the problem if the loop is not closed in a reasonable time. The map estimate 
registers all landmarks and obstacles discovered by the vehicle, but further performs an 
optimization to minimize positional uncertainty of the vehicle and the features [34]. 

From a theoretical and conceptual perspective, the SLAM problem is considered solved, but 
from an implementation perspective SLAM still has difficulties in relation to computational 
complexity (proportional to the square of the number of landmarks) [35] and the convergence 
of the algorithm [37]. Further, SLAM is a relativity model using the relative joint state 
model of the robot pose and its relative locations to the observed landmarks [37]. The 
process model affects the vehicle pose states, and the observation model only affects the 
vehicle-landmark pair [37]. While the optimal algorithms aim to reduce the computational 
complexity of the problem, the solutions still result in estimates and covariances that are 
equal to those without the reduction [37]. 
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Active SLAM 

Active SLAM research utilizes the SLAM framework, but adds an exploration versus 
exploitation decision making dynamic [36]. Two popular models for decision making, 
information theoretic and POMDP [36], are discussed further in Sections 2.5 and 2.6. The 
optimal decision, within time and computational budget constraints, maximizes the area 
explored while also minimizing the positional uncertainty [38]. 

2.2.3 Geophysical Navigation in ATAN 

ATAN draws techniques from both the TAN and the SLAM domains. For the purposes of 
the thesis, TAN provides the underlying structure of ATAN since TAN involves a complete 
estimate of the terrain associated with the UV coverage mission and permits a correlation 
estimate at any position. Conversely, SLAM techniques involve a data association task for 
determining the correlation by a sensor measurement and a finite number of landmarks. 
Overall the differences between TAN and SLAM are minor and future work (not covered 
in this thesis) will include the development of an Active SLAM component incorporating 
the techniques used in this thesis. 

Within TAN, the correlation method was adopted for position estimation (Chapter 3) and 
the ability to model terrain using a Gaussian distribution was used for the exploration 
component (Chapter 4). The position estimation can be further improved using methods 
in [31], discussed in future work not covered by this thesis. 

The underlying fundamentals of SLAM relate to the approach of this thesis: feature extrac¬ 
tion, data association, and map generation. However, instead of feature classification, each 
lm x lm cell is quantified through the unevenness of the terrain. The position estimator 
compares the current AUV observation with projected position observations to determine 
the best position. The map is generated through the uncertainty of the AUV position and 
AUV sensors using a model discussed in Section 2.4. While SLAM attempts to do all 
calculations concurrently, this thesis intentionally separates the process, preferring a dual 
optimization problem with exploration and exploitation. 

Applicable methods to ATAN used within Active SLAM include path planning, informa¬ 
tion theoretic, and reinforcement learning. These concepts are universal between Active 
SLAM and ATAN, as both SLAM and TAN share the goal of maximizing exploration 
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while minimizing vehicle error. The difference between ATAN and SLAM is that ATAN 
intentionally breaks the optimization apart and instead performs a dual optimization be¬ 
tween exploration and exploitation before re-combining while SLAM performs a single 
optimization. As a result, exploration and exploitation can be modeled independently and 
then brought together with a reward function (Chapter 6): a mathematical framework that 
prioritizes either exploration (Chapter 4) or exploitation (Chapter 5) depending on the state 
of the UV. 


2.3 Path Planning 

Path planning has been around since at least 1991 with the goal of collision-free trajectories 
that have the robot travel from an initial state to the goal state in the fastest time [39]. 
Decision-theoretic planning was introduced around 2006 [40], to select the best path for 
the goal state. Recent techniques in path planning utilize either Gaussian Processes (GPs) 
or maximum Likelihood estimates (MLEs) which was used prior to implementing and 
designing ATAN. 

2.3.1 Gaussian Process (GP) 

GPs can help classify value functions within the path planning environment. Pragr, Cizek, 
Bayer, and Faigl use a GP to quantify terrain traversal cost (TTC) [41] for a multi-legged 
walking robot. The robot explores by using the GP predictive variance to determine whether 
to discover new areas ( exploration ) or decrease uncertainty within the TTC ( exploitation ) 
[41]. The key takeaway from Pragr et al. is the combination of Gaussian Processes using a 
Robust Bayesian Committee Machine (RBCM) which reduces the computational complexity 
from O(A 3 ) to approximately 0(A) [41]. In another instance of using a GP, Nardi and 
Stachniss use a GP mixture model coupled with a top-down aerial picture to reach a desired 
end state resulting in improved navigation [42]. The GP mixture model inputs experiences 
from the vehicle to determine the proper action between exploring new areas and exploiting 
the environment by circumventing obstacles or difficult terrain [42]. Lastly, Rui and Chitre 
utilize a path planning algorithm to reach a desired end state by using a prior bathymetry 
map combined with a GPM to minimize the AUV uncertainty ( exploitation ) over the path 
length [43]. 
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2.3.2 Maximum Likelihood Estimate (MLE) 

The MLE determines the most probable outcome using a probabilistic model of observed 
data, which can be useful to exploit partially observable environments for path planning. 
Zambom, Seguin, and Zhao use a MLE to determine future positions of moving obstacles to 
determine the shortest path [44]. Ornik and Topcu propose a MLE path planning algorithm 
within a previously unknown time varying environment [45]. Using only previously ob¬ 
served information and known system dynamics, the MLE can either improve the accuracy 
of the MLE by exploring new areas or use current estimates of the environment ( exploitation ) 
to progress closer to the objective [45]. Zhang, Hu, Chadha, and Singh propose a Maximum 
Likelihood Path Planning algorithm that determines the maximum likelihood to reach the 
goal (or end of trajectory) and avoid obstacles, exploiting the environment within sensor 
range and probabilistically determining the environment beyond the sensor range [46]. By 
using a limited trajectory, the computational complexity is lowered to a manageable level, 
allowing the robot to quickly make decisions regarding path planning [46]. 

2.3.3 Path Planning in ATAN 

Path planning is an important part of ATAN, as the correct choice between exploration and 
exploitation must be considered and evaluated during the path planning portion. 

The exploration component (Chapter 4) must be able to measure the amount of exploration 
over all possible trajectories, including unexplored areas, areas with large uncertainty, and 
areas perfectly known. The quantification of exploration lends itself to a GPM as a MLE 
cannot accurately estimate unknown areas. 

Conversely, the exploitation component (Chapter 5) uses a MLE, as areas previously ex¬ 
plored can be measured through terrain classification and positional uncertainty of the 
vehicle when the location was measured. This thesis utilizes a similar approach to the 
Zhang et al. Maximum Likelihood Path Planning algorithm with slight modifications. 

2.4 Coverage Planning 

Coverage has many similarities with path planning, as both require an understanding of 
exploration and exploitation. However, the key difference is that while path planning gets 
from a starting location to an ending location, coverage planning must pass over all points 
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of interest within the map and is evaluated by the quality of the finished map. There are 
two types of coverage planning: inspection planning and full coverage planning. 


2.4.1 Inspection Planning 

Inspection planning, a subset of coverage planning, uses robots to examine certain features 
of an overall map with the goal of minimizing trajectory length and maximizing the number 
of features visited [47]. Essentially, the exploration component attempts to maximize the 
number of features visited and the exploitation component attempts to maximize informa¬ 
tion collected at each feature, which is made difficult since the computational complexity of 
inspection planning grows exponentially with the number of features [47]. Fu, Kuntz, Salz- 
man, and Alterovitz propose an inspection planning method called Incremental Random 
Inspection-roadmap Search (IRIS) [47], which provides a structure that efficiently computes 
asymptotically-optimal inspection plans [47]. Another inspection planning algorithm, er- 
godic coverage directs and operates robots to certain areas for specified times based on the 
probability of locating a feature within that search space [48]. Furthermore, additions to er- 
godic coverage given by Ayvali, Salman, and Choset can be applied to non-linear problems, 
incorporating sensor areas and obstacle avoidance [48]. 

2.4.2 Full Coverage Planning 

Full coverage planning covers every area of the map with a sensor. The oldest approach 
is the Boustrophedon method, synonymous with the back and forth motion of mowing the 
lawn [40]. The issue with the Boustrophedon method is that while exploration is maximized, 
exploitation does not exist as areas are not covered multiple times, which translates to a 
growing positional uncertainty. 

One of the earlier applications of full coverage planning was frontier-based exploration 
which uses an occupancy grid to track previously explored areas [4], [49]. The intersection 
between the known and unknown areas, called frontiers, were visited until there were no 
frontiers remaining, an equivalent way to express complete coverage [49]. 
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2.4.3 Coverage Planning in ATAN 

ATAN utili z es exploration tactics from [4], as further discussed in Section 2.5. However, 
ATAN can be improved further with inspection planning to specifically target unexplored 
areas when certain thresholds of coverage are met, which is left for future work. 

2.5 Information Theory 

Claude Shannon started the Information Theory field in 1948 to "measure information, 
choice, and uncertainty" [50]. The field of Information Theory has evolved significantly 
since [51], [52], encompassing an expansive field that includes probability theory, entropy, 
GPs, and mutual information while maintaining a common theme: the quantification of 
information. 

In most robotic applications, correctly measuring the change of uncertainty directly con¬ 
tributes to the success of the the robot making correct decisions, particularly within the 
exploration and exploitation domains. 

One popular method for quantifying the exploration-exploitation dilemma is through the 
uncertainty measurement of Rao-Blackwellized particle filters. Stachniss, Grisetti, and 
Burgard use the change of entropy within a Rao-Blackwellized particle filter to measure the 
information gain of an action [53]. By combining multiple actions together, the exploration- 
exploitation dynamic can be resolved by the greatest reduction of uncertainty within the 
posterior [53]. Carlone, Du, Ng, Bona, and Indri also investigate the uncertainty within Rao- 
Blackwellized particle filters, but instead use the Kullback Leibler Divergence (KLD) to 
determine the best expected information gain of a policy which is a function of information 
gain ( exploration ) and loop closure ( exploitation ) [54]. The advantage of quantifying 
exploration further allows for the re-visitation of previously explored areas to lower the 
uncertainty of the higher uncertainty areas [54]. 

Another method to quantify exploration was proposed by Lauri and Ritala to model obser¬ 
vations, z = f(s,a), probabilistically as a function of state s and action a, which can then 
be used to calculate the belief state b through Bayesian filtering [4]. The information gain 
of a trajectory (measure of exploration ) is a function of the belief state [4]. 

Ayvali, Salman, and Choset use ergodicity to determine exploration trajectories which result 


19 



in weighting certain areas of a map based on the probability of success [48]. Taking the 
Fourier Transform of the trajectory and the desired coverage distributions, higher weighting 
values can be given to lower frequency components which correspond to higher similarities 
between trajectory and desired coverage, leading to efficient exploration [48]. The norm of 
each distribution can then be compared using KLD, where the best trajectory will lead to 
the highest mutual information [48]. 

Rui and Chitre utilize a path planning algorithm to reach a desired end state by using a 
prior bathymetry map combined with a GPM and reinforcement learning (RL) to minimize 
localization uncertainty [43]. Using information entropy of the terrain, the GPM determines 
the estimated value function of localization [43]. 

2.5.1 Information Theory in ATAN 

ATAN conducts a dual optimization through an information-theoretic framework that prop¬ 
erly prioritizes exploration and exploitation. The exploration (Chapter 4) component is 
calculated by a spatial GPM and mutual information gain of the KLD. The position estima¬ 
tion uses a PF to represent the multi-modal, non-Gaussian, non-linear error associated with 
the vehicle. The variance of the PF and GPM are then used as inputs to the gain values of 
the reward function to prioritize exploration and exploitation (Chapter 6). 

2.6 Reinforcement Learning 

RL is subsection of machine learning where machine agents make decisions to maximize 
the total rewards over a set of actions. Common forms of RL include the POMDP, MCTS, 
or the combination of the two: POMCP. 

2.6.1 Partially Observable Markov Decision Process (POMDP) 

A POMDP is a decision process for partially observable environments that has been used 
to make decisions in real-time scenarios. Cai, Luo, Saxena, Hsu, and Lee at the National 
University of Singapore utilized the combination of a search algorithm to bias a POMDP 
to minimize the computational complexity and provide real time solutions for autonomous 
driving through a crowded environment, called LeTS-Drive [55]. Lauri and Ritala proposed 
a framework to find the best control policy to maximize Exploration utilizing an open loop 
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POMDP to efficiently and effectively explore an area using a Turtlebot [4] equipped with 
laser range finders (LRF). The Turtlebot operates within a Robot Operating System (ROS) 
environment that starts with an empty goal, empty result, and uses feedback to generate 
exploration tasks [4]. The exploration task is then generated using a planner software 
of costmaps to record open areas, obstacles, and unknown space using the probability of 
occupying the cell [4]. Based on the costmaps, the Turtlebot calculates the best trajectory 
from randomized trajectories (ten trajectories of three control actions) using a POMDP 
and executes the best trajectory using move_base [4], Once the trajectory is complete, the 
Turtlebot pauses and repeats the process [4]. 

2.6.2 Monte Carlo Tree Search (MCTS) 

MCTS is an effective technique for searching over a rich configuration space. It uses random 
sampling to search a decision space, normally structured as a tree, with the purpose of finding 
optimal decisions within the given domain [7]. MCTS is a popular decision-theoretic 
framework within Game Theory, mapping considerable branching factors of large fully 
observable decision spaces l ik e Chess, Checkers, Chinese Checkers, and AlphaGo. Liu, 
Zhou, Cao, Qu, Yeung, and Chung used a biased MCTS coupled with deep reinforcement 
learning to model Chinese Checkers [56]. The key difference between Chinese Checkers 
compared to most board games is that all pieces stay on the board, and all pieces can move 
in any direction which leads to a non-narrowing, consistent branching factor [56]. 

AlphaGo 

AlphaGo has an extremely large search space with a substantial branching factor correlating 
to an average of 250 moves per play [7]. Evaluation of multiple moves can be represented 
mathematically through the number of moves available, b ~ 250, and number of moves per 
game, d ~ 150, yielding a search space of b d [5]. AlphaGo uses two networks: policy 
and network [5]. The policy network computes the conditional probability, p{a\s), of 
all legal moves a given the board representation s [5]. The value network computes the 
expected outcome of the move using the future board s'. With b d possible moves [5], the 
computational complexity to rival human experts is difficult. 

However, by using and effectively biasing the MCTS, AlphaGo has achieved success, 
noted by a 99.8% win rate against programs and winning five games to zero against the 
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European Champion [5]. Within the MCTS, each node stores a prior probability p{s,a), a 
visit count N(s,a), and a reward Q(s,a ) to bias the MCTS during each iteration [5]. The 
bias allows the MCTS to continue choosing branches with higher values ( exploitation ) but 
also ensuring other possibly optimal policies are also searched ( exploration ), allowing the 
MCTS to quickly search an immense decision space and converge close to the optimal value 
function [5]. The MCTS is further improved using deep convolutional neural networks to 
reduce the search space (depth and breadth of the tree), with supervised and reinforcement 
learning occurring to better tailor the search [5]. 

The ability of AlphaGo to quickly and advantageously search a large decision space set 
the foundation for this thesis: incorporating RL into TAN. RL improves TAN by utilizing 
the MCTS coupled with biasing to perform real-time evaluations of future trajectories for 
exploration and exploitation, yielding efficient coverage while also minimizing UV position 
error. 

2.6.3 Partially Observable Monte Carlo Planning (POMCP) 

A POMCP is a combination of the POMDP and the MCTS, extending the MCTS to a 
partially observable environment [6] and can be used to solve finite POMDPs [4]. There 
are multiple ways to bias the POMCP, with Silver and Veness providing a biasing technique 
that allows for exploration of all branches within a node, and then selecting the branch with 
the highest reward for further searches [6]. In another approach to biasing the POMCP, 
Castellini, Chalkiadakis, and Farinelli constrain the search of the POMCP through proba¬ 
bilistic Markov random fields [57]. Bai, Wu, and Chen propose two POMCP algorithms that 
utilize cumulative reward from an MCTS search, represented as an unknown distribution 
random variable, to weight exploration and exploitation inside the MCTS [58]. 

2.6.4 Reinforcement Learning within ATAN 

ATAN utilizes key aspects of RL, specifically from AlphaGo, to aid in the determination 
of the next trajectory by implementing a POMCP as a trajectory evaluator. The POMCP 
equates to the combination of a POMDP and a MCTS. Within ATAN, the POMDP models 
the exploration and exploitation components through the first horizon consisting of multiple 
trajectories of a certain distance and the MCTS searching beyond the first horizon to a further 
secondary horizon. Each horizon exponentially increases the computational complexity of 
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the trajectory evaluator, but by using the MCTS only selected trajectories are required to be 
calculated. Further, the MCTS can be improved through biasing, using the same biasing 
techniques shown in [6]. Through the use of techniques in RL, ATAN determines the best 
trajectory within a finite time to correctly prioritize exploration and exploitation. 

2.7 Contributions of This Thesis 

This thesis responds to the Department of Defense Artificial Intelligence initiative [59] 
by increasing the decision capability of autonomous robots in both military and civilian 
applications. 

ATAN quantifies the terrain through a Boltzmann entropy representation [60], [61], but 
more specifically, our work presents the following as major contributions: 

• A real-time information-theoretic trajectory planner that adaptively leverages explo¬ 
ration and exploitation to solve for the best trajectory within a computational limit 

• A real-time exploration component that maximizes the information gain of the se¬ 
lected trajectory through a GPM 

• A real-time mapping algorithm that does not require a prior map 

• A comparative framework that allows for analysis of different approaches—POMDP, 
POMCP, and biased POMCP—within ATAN. 

Through these contributions, ATAN stresses greater levels of autonomy for UVs and im¬ 
proves upon the foundation and application of TAN. 
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CHAPTER 3: 
Position Estimator 


Position estimation is essential for the success of ATAN. Without the ability to accurately 
estimate position, the positional uncertainty of the UV increases over distance traveled, 
yielding a less accurate map. Further, an inaccurate position estimator eliminates the ability 
to evaluate and employ the best trajectory for exploration , exploitation , or a combination of 
the two. Therefore, ATAN requires a robust position estimator to ensure positional accuracy. 
This chapter introduces and designs the optimal position estimator prior to evaluating the 
performance of the optimal position estimator. 

3.1 Introduction to Optimal Position Estimation 

Two popular types of optimal position estimators—the Kalman filter (KF) and the particle 
filter (PF)—use stochastic estimation to minimize positional uncertainty. 

The KF is a linear, recursive position estimator that uses a set of prediction and update 
equations to calculate a position estimate for a UV. The prediction step starts with an a 
priori state estimate; an error covariance; a control action; and a process noise covariance 
to calculate the state estimate and the respective error covariance. The update step then uses 
the state estimate and error covariance as well as observation noise covariance to determine 
the a posteriori state estimate and error covariance. In order for the KF to correctly model 
the UV, the KF necessitates a linear system model in which system errors, vehicle position, 
and observations follow a Gaussian distribution. For non-linear models, the KF can be 
adjusted using the Extended Kalman Filter (EKF), which approximates the non-linearity by 
using Taylor series expansion to create a linear system. While the KF is useful for position 
estimation, the KF has the following limitations: the KF computational complexity, 0(N 2 ), 
poses limitations on larger data sets, where N is the number of data points [62]; the KF 
cannot use negative information and does not have the ability to see a feature that isn’t 
expected [62]; and the KF builds on previous data, which can lead to failure of the model 
if the previous data is inaccurate [62]. 

The other common position estimator, the PF, uses a probabilistic approach to conduct 
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position estimation. The PF generates possible hypotheses around a position estimate x t , 
called particles, that each contain individual state estimates. A transition model moves 
each particle in accordance with the vehicle motion model, and a measurement model 
determines the observation of each particle, Zt, using an onboard map. The measurement 
model further calculates the particle weight, w t = p(x t \z t ), through the correlation of the 
particle observation and the actual observation of the vehicle, in which the highest weights 
give better position estimates. The PF keeps the higher weighted particles, and the process 
is repeated. 

The main advantage of the PF is the ability to track multi-modal, non-linear, non-Gaussian 
probability densities [63], [64], [65], [66]. Thrun, Burgard, and Fox also provide additional 
PF advantages: the PF computational complexity, 0(A), is linear with respect to the 
number of particles, N [67]; the PF works for any high dimensional system, observation, or 
motion model [67]; the PF allows for multiple different types of combinations of probability 
distributions [67]; the PF maintains the ability to correct itself after converging towards an 
erroneous state (kidnapping); and the PF algorithm is relatively easy to implement [67]. 

Thrun, Burgard, and Fox also give the following PF limitations: the PF can suffer from a 
loss of particle diversification when samples converge [67]; the PF may fail with minimal 
noise present [67]; the PF performance is difficult to measure [67]; and the PF is is hard to 
predict [67]. 

Geophysical navigation introduces multiple error sources [33] that combine into a non¬ 
linear, non-Gaussian problem [68], [69], [70]. As the limitations of the PF can be alleviated, 
as discussed in Section 3.2, the advantages of the PF allow for a better position estimation 
and allow ATAN to better estimate a non-linear problem that does not conform to Gaussian 
distributions. 


3.2 Particle Filter 

The PF is a position estimator that can be used for complicated, nonlinear applications such 
as ATAN. Starting with multiple state hypothesises, or particles, the PF converges to the 
best hypothesis (Figure 3.1) through Bayes filtering: estimating the current state based on 
sensor measurements and independent, identically distributed (IID) data [71]. 
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Figure 3.1. Two-Dimensional PF Visual Representation with State Hypothe- 
sess During Different Time Updates. 


As seen in Figure 3.1, the particles initially start covering a large distribution, and as the 
robot moves through the environment, converge on the location of the robot. 

3.2.1 Monte Carlo Localization (MCL) 

In this thesis, a specific type of PF, the monte carlo localization (MCL), shown in Algorithm 
1, is used to locali z e the UV. The MCL consists of an Initialization, Motion Model (Line 
4), Sensor Model (Line 5), and a Resampling process (Lines 8-11) where B, is the full belief 
state , a t is the action, z. t is the observation, n is the particle, and N is the total number of 
particles. 


27 








The initialization consists of the particle generation and the initial belief state, b tQ . Using the 
initial belief state b tQ , sensor measurement Zt 0 , and action sequence a to , the initial position 
estimate x tQ can be determined through the maximum conditional probability, 

x = arg max p(b t0 \zt 0 )- (3.1) 

zeZ 

The particle generation utilizes a bootstrap method, random sampling of a data-set with re¬ 
placement, to produce a Gaussian distribution ~N(x,cr 2 ) around the position estimate. 
The output of the initialization gives N particles representing the current belief state 
(b tl ,...b tN ) G Bj and the posterior probability density of the current belief state. 

The motion model, line 4 in algorithm 1, uses the inputs of a control action a t and prior 
belief state b t _ x to output the current belief state b) by updating all particles based on 
the control action of the vehicle. For example, if the UV chooses an action of maintaining 
course and speed then all particles maintain course and speed, or if the UV turns right at a 
heading rate of two degrees per minute then all particles turn right using a heading rate of 
two degrees per minute. With each control action, the particles diverge, an expected result 
as the positional uncertainty increases as the UV moves through the trajectory. A key aspect 
of the motion model is the incorporation of the control actuator error which can lead to the 
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PF converging on the wrong solution if the control actuator error is modeled incorrectly 
or not given a large enough magnitude. Therefore, error within the motion model must 
incorporate the error of the UV movement model to ensure PF effectiveness. 

The sensor model, line 5 in algorithm 1, uses the output belief state of the motion model, 
b\ , and computes the weights, or level of importance, of each particle using the observation 
of each particle. This can be more easily described using Figure 3.2. 


Hidden State Variables: x 



Observed State Variables: z 

Figure 3.2. Particle Filter Sensor Model Visual Description. 


As the UV moves, the true state of the UV is unknown (hidden states) while the observations 
are known. For each particle, the UV takes the individual particle observation based on the 
pose of the particle, and compares the particle observation to the current UV observation. 
By moving the particles with the same control actions of the UV, the particles that correlate 
better to the observed terrain are closer to the true UV position. The correlation between 
the particle observation and the UV observation is the particle weight, or likelihood of the 
position of the UV at the particle. With the new weights, a particle with a larger correlation 
has the higher probability of being more accurate. 

Resampling, implemented in Algorithm 1 lines 8 through 11, draws with replacement, to 
make a new V-particle set with a new distribution based on the previous weighting values 
from the sensor model [67]. This changes the distribution of particles from the initial 
distribution, b(x t ), to the posterior distribution, b(x t ) = r]p(zt\x\"^)b(x t ), with particles with 
lower weighting values less likely to be included in the new distribution [67]. 

3.2.2 Particle Kidnapping 

Particle Kidnapping, or particles converging on the wrong location, is a significant problem 
with the PF. However, this issue can easily be avoided by adding a small percentage 
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of particles to each particle set which can be justified by the low probability of the PF 
converging incorrectly [67]. The number of random particles added can be directly related 
to the average measurement probability [67]: 


1 

N 


Z 


n= 1 





(3.2) 


Thrun, Burgard, and Fox show an augmented MCL, shown in Algorithm 2, that uses 
two parameters to model exponential decay rates: slow, a s i ow , and fast, af ast , in which 
0 < a s i ow a fast [67]. During the re-sampling process, the probability of generating a 
random sample can be mathematically defined through max( 0.0,1 - Wf ast /w s i ow > 0). 


Algorithm 2: Augmented MCL [67] 


1 MCL{B t -\,a t ,z t ,n) 

2 Static W s iowi m/ast b! ~ — 0, 

3 for n = 1 to N do 

4 b\"\ = motion model(a f , 

5 w\ n][ = sensor modelfo, b\' l \n); 

6 B t = B t +<b l ; ,] ,w\ n] >-, 

, 1 [«] 

7 UJavg ~ W a vg 

8 end 

9 W s / ow — W s / ow + G 'si ow (fWavg ~ Wslow ) 5 

10 Wfasl ~ fast + ® fasti^avg ~ ^ fast) 5 

11 for n = 1 to N do 

12 with probability max( 0 . 0,1 - Wf as t/u)siow ) do 

13 add random pose to B, 

14 else draw i 6 1 ,.... /Vwith probability oc w\ ; 

is add b\ to B t ; 

16 end 

17 end 

is return B t 
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3.2.3 Effective Sample Size (ESS) and Resampling 

Resampling—changing the distribution of the particles—during every iteration can lead to 
errors within the PF by introducing a loss of diversity [67]. An important metric of particle 
diversity, known as effective sampling size (ESS), is shown in equation 3.3 [72], [73], [66]: 


ESS = 


1 


z£i K ) 2 


(3.3) 


where % is the normalized weight of each particle. Instead of changing the particle 
distribution during every iteration, the distribution change only needs to occur when ESS < 
/3N, where [3 € (0,1) (typically 0.5) [72] and N is the total number of particles. By choosing 
only specified times to resample, the particles maintain more diversity and therefore are 
more likely to converge on the correct position of the UV. 

There are many ways to conduct resampling [74]. Commonly used techniques within 
resampling include multinominal, residual, stratified, systematic, and a parallel resampling 
method called Metropolis [73], [74]. However, each of these re-sampling techniques leads 
to higher computational complexity [73], [74]. Within ATAN, there is a finite amount of 
time and computational cost allowed such that real-time decisions can be made. Therefore, 
by increasing the cost of the PF, the cost of another ATAN process (shown in Figure 1.1) 
must be reduced. By resampling only when ESS < f3N using the resampling methodology 
shown in Algorithm 1, the PF requires fewer computations, but also maintains a positional 
uncertainty measurement of less than one meter, as discussed in Section 3.3, which is 
acceptable for this thesis. 


3.2.4 Multimodal Distribution 

The PF in Algorithm 2 is solely focused around one Gaussian distribution, but can be 
improved with minimal computational cost by using multimodal distributions in which 
multiple modes are used to increase PF robustness. Multi-modal distributions also give 
greater particle diversity which minimizes the chance of resampling [75], and further, the 
loss of multimodality [66]. Recent PF techniques utilize multimodal distributions through 
cluster-based distributions [66], [76], but multimodal distributions come with two issues 
that need to be solved in order to produce an effective filter: (1) choosing and implementing 
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each mode and (2) combining modes, if practicable [77]. 

In this thesis, the multimodal model takes a simple approach by taking the top ten weighted 
particles, 

n$ = arg max w n , t -\ (3.4) 

72 SN 

in which # 6 Z {§ = 1,2,3... 10 for this thesis) to be used as the modes for the next 
distribution. The distribution of particles around particle is determined through the 
weighted mean distances from particle no to the other top particles, 


Z «7-i d n 
Z w,- 


Lastly, the probability of each particle occurring within a distribution is proportional to the 
weight of the particle, 


p(&n\w t -i) = a ' 1 —. (3.6) 

Z t -1 

Using equations 3.4-3.6, the multimodal distribution can be employed. 

3.2.5 Final Model 

Combining mitigations for particle kidnapping and ESS, and adding in the multi-modal 
distribution culminate in the final MCL algorithm used for ATAN, as shown in Algorithm 
3. While Lines 1-11 are the same as in Algorithm 2 [67], particle distribution changes 
(Lines 12-22) only occurs when the ESS is less than j3N. 
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Algorithm 3: Final MCL 


1 Final MCL(B t -i,a t ,Zt,n) 

2 Static W s Jowi IV fast 

3 B t = B t = 0 

4 for n = 1 to N do 

5 = motion model(a f , b ^ } ) 

6 w\"^ = sensor modelfe, b\ n \n) 

i B t = B t + < b [ "\w\ n] > 

, l M 

8 W av g — W a vg + pjW t 

9 end 

10 W s / ow — W s / ow + (Xsi ow (lV av g — W s l ow ) 

11 IVfast = Wfast + ® fasti^Vavg ~ IVfast) 

12 if ESS < f 3 N then 

13 for n = 1 to N do 

14 p e E([0, 1] ; 

is if max( 0 . 0,1 - Wf ast /w s i 0 w ) < <P then 

16 | add random pose to B t 

17 else 

is draw i from multi-modal sample N(n$, cr 2 ) 

19 add b' t to B r 

20 end 

21 end 

22 else 

23 draw i from multi-modal sample N{n$, cr 2 , ) 

24 add b\ to B, 

25 end 

26 return B t 





3.3 Particle Filter Implementation and Analysis 

The Algorithm 3 PF was implemented and analyzed using a full, accurate a priori map 
through a map model, motion model, sensor model, and establishing PF parameters with a 
maximum AUV uncertainty equal to one meter. 

3.3.1 Map 

A 255m x 255m bathymetry map (Figure 3.3) provided by California State University, 
Monterey Bay (CSUMB) [78] was validated using the REMUS prior to the simulation. 



Figure 3.3. Sea Floor Bathymetry Provided by CSUMB, Verified by REMUS. 
Adapted from [78]. 


Shown in Figure 3.3, the one meter resolution map area was chosen due to its combination 
of flat, rough, and changing terrain that also was in the required depth range of the REMUS. 
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3.3.2 Motion Model 

The simulated motion model utilized the same model for vehicle position updates, but also 
incorporated a noise aspect. The actual vehicle motion model uses a Kalman Filter, 

x = Av + xo (3.7) 

in which A is a transition matrix, v is the AUV velocity vector, and xo is the original position 
vector. The PF motion model added two noise aspects to equation 3.7 to account for position 
error and velocity error, with noise power equal to 0.2 and 0.3, respectively. 

3.3.3 Sensor System Modeling 

The REMUS vehicle has two predominant sensor systems: BlueView MB2250 downward 
looking sonar and ARC Scout side-scan sonar. The BlueView MB2250 sonar sensor gives 
an image space, linear twenty meter swath, directly under and perpendicular to the vehicle. 
The ARC Scout side-scan sonar also outputs in image space, with a range of thirty feet [17]. 
BlueView MB2250 sonar sensor and ARC Scout side-scan sonar image spaces are shown 
in Figure 3.4. 



Figure 3.4. Sensor System Model, (left) Image Space BlueView MB2250. 
(right) Image Space ARC Scout Side-Scan Sonar. Source: [13]. 


Through computer vision, both the BlueView MB2250 image and side-scan sonar image can 
be transformed into a spatial grid, discussed in future work not completed with this thesis. 
In this thesis we utilize a simplification of the sensor system that emulates the combination 
of both sensors as a ten-meter circular swath around the vehicle, as shown in Figure 3.5. 
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Figure 3.5. Sensor Model, (a) Top-down view of REMUS and sensor of 10m 
radius, (b) View of 10m sensor model over terrain. 
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The weighting of each particle can be determined through the correlation method [26]. 
From the full, accurate map, the sensor model uses the correlation between the particle 
observation, with noise power equal to 0.2, and the AUV sensor, shown in equation 3.8: 

\n\ E[(s p - m p ){s a - m a )\ 

W. — - (J.o) 

cr s cr s 

in which the particle sensor matrix s p ,particle sensor mean m p , AUV sensor matrix s a , AUV 
sensor mean m a , particle sensor standard deviation, cr Sp and AUV sensor standard deviation 
cr s are known or can be calculated. 

3.3.4 Particle Filter Parameters 

The PF in Algorithm 3 was implemented in MATLAB for position estimation of the AUV 
used in this thesis over a 2000 meter and 8000 meter trajectory to gauge accuracy and also 
evaluate recovery from kidnapping using the constants shown in Table 3.1. 


Table 3.1. Particle Filter Constants used in Implementation. 


PF Constants 

N 

1000 

£ 

0.5 

Wslow 

0.5 

Onslow 

0.1 

W fast 

0.5 

@fast 

0.9 


Over a short term trajectory length of 2000m (Figure 3.6) and long term trajectory length 
of 8000m (Figure 3.7), the PF maintains an error of less than one meter. Based on the 
resolution of the map, the simulation the PF is able to localize the AUV position to within 
a single 1 m x lm cell. 
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Figure 3.6. Particle Filter Performance Over 2000m Trajectory Length Using 
Algorithm 3. (a) AUV trajectory in red, PF trajectory in black (completely 
overlapped by red trajectory) shows small degree of error throughout the 
AUV motion, (b) PF short term error in meters along trajectory. 
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Figure 3.7. Particle Filter Performance Over 8000m Trajectory Length Using 
Algorithm 3. (a) AUV trajectory in red, PF trajectory in black (completely 
overlapped by red trajectory) shows small degree of error throughout the 
AUV motion, (b) PF long term error in meters along trajectory. 
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The PF was also tested in regards to particle kidnapping. The estimated position of the 
AUV was moved to a random location once the AUV traveled 2000 meters, as shown in 
Figure 3.8, to simulate the response of the PF converging on an incorrect position and to 
determine the recovery of the PF. The PF error before, during, and after kidnapping is 
shown in Figures 3.9-3.11. 
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Figure 3.8. Particle Filter Response to Particle Kidnapping: (red) AUV 
Trajectory (black) PF Position Estimate. 
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BEFORE 



Figure 3.9. Particle Filter Error Prior to Particle Kidnapping. 



Figure 3.10. Particle Filter Error Response and Recovery to Particle Kidnap¬ 
ping. 
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Figure 3.11. Particle Filter Error After Recovery from Particle Kidnapping. 


As seen in Figures 3.9-3.11, the PF error increases drastically to « 140 meters (Figure 3.10) 
when the AUV is moved at a path length of 2000m. However, the PF recovers to within 1.4 
meters of positional error after the AUV travels 50 meters (Figure 3.11), and within 1 meter 
of positional error after around 230 meters of distance traveled, successfully converging on 
the correct AUV position. 

3.4 Particle Filter Conclusion 

The AUV position estimator, the PF, is able to localize the position of the AUV to within 
±lm over an 8000m trajectory using an a priori map. For comparison, the INS discussed 
in Table 1.1, would generate an error of 40m with the same trajectory. The high success 
of the PF can be attributed to the simplifications made to the sensor system through the 
employment of a circular sensor. However, PF has been proven to maintain one meter of 
error in real time within TAN, shown in [33], giving the assumption results credibility. 

The PF increases the ability to estimate position by maintaining lm or less of positional 
uncertainty over an 8000m trajectory which achieves the goal of minimizing positional error 
listed in Section 1.1. But the PF can still be improved. Unfortunately, the PF is only as 
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good as the map that the AUV creates. The map created and used has a resolution of lm 
which makes it difficult for the AUV to localize, and maintain localized, to an error of less 
than 1 meter. However, the PF is easily scaleable, and therefore the sensor accuracy will be 
the limiting factor as opposed to the map accuracy when making a map (not in simulation). 

Lastly, the outstanding PF performance is due to two factors. First, the amount of information 
within the map (rough, uneven bathymetry) contributes to the PF easily determining the 
AUV position (with noise sensor noise having a minimal affect). Second, the PF success 
is due to the a priori map, which during ATAN the PF will no longer have. Instead, the 
input to the PF will only be the AUV generated map, which comes full circle back to the 
exploration-exploitation dilemma. 
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CHAPTER 4: 
Exploration 


Many machine learning techniques use the following template: sample from a distribution, 
derive multiple IID random variables, solve an optimization problem to build a model, and 
make predictions based on the model. This chapter builds, maintains, and utilizes an optimal 
spatial estimate model of a random field, mathematically depicted as a Gaussian Process 
Model (GPM), to conduct exploration planning. The GPM provides a mean estimate 
and covariance estimate, the later of which represents the uncertainty measure quantifying 
exploration. Through the GPM, exploration can be defined as the reduction of uncertainty 
in the GPM by either exploring new, unmapped areas or by decreasing the uncertainty in 
high uncertainty areas through re-visitation. This chapter discusses the GPM and describes 
how the GPM can be used for path planning in a real-time process. 

4.1 Gaussian Process Model 

A GP is defined by Rasmussen and Williams as multiple Gaussian random variables [79]. 
For modeling the terrain, a bounded rectangular N x M region is subdivided into n x m 
cells where each cell m can be fully described as a Gaussian distribution with a mean p and 
a covariance k(m,nT), ~ N(p, k(m,nT)). This distribution is based on the center of the cell 
and is known as the point of interest (POI) of the cell. 

In order to employ a GPM as a spatial estimator of terrain, three assumptions and two 
simplifications need to be made. 

Assumptions 

The following assumptions are associated with using the GPM: 

• There exists a function that describes the exact terrain. To minimize computational 
complexity, linear interpolation is used , z = wx + e, in which z is the observed target 
value, w is a vector of weighted parameters, x is the location, and e is IID additive 
noise with ~ N(0,cr“)) [79]. 
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• All cells within the bounded region adhere to a second-order stationarity assumption: 
the mean and covariance of each individual cell do not vary significantly over time 
and space, and the covariance between each cell is only a function of the relative 
distance between two cells. 

• The covariance relationship with distance can be described as isotropic—uniform 
in all directions—with the covariance inversely proportional to distance. As the 
distance increases, the similarities within the terrain tend to decrease, leading to a 
lower covariance. 

Simplifications 

The following simplifications are associated with using the GPM: 

• The information gain is a function of the forward looking sensor only on the direct 
path of the vehicle when calculating a potential trajectory for exploration. This 
constraint reduces the computational complexity of the GPM by not determining the 
information gain over every cell touched by the sensor but rather only those which 
the UV drives directly over. 

• The predictive distribution (mean prediction and covariance) of the GPM are cal¬ 
culated for one test point on the UV path, using a random sampling of nearby 
surrounding points. By only having one test point, the computational complexity of 
the GPM decreases, and the mean and covariance of the cells can still be adequately 
modeled. 

Employing the GPM 

Taking a function-space view approach, the Gaussian process can be defined as [79] 

/(m) ~ QP(n(m), k( m,m')), (4.1) 

in which the mean function ju(m) and covariance function k(m,nT) can be expressed 

using [79] 


/r(m) = E[/(m)] 

k{ m,m') = E[(/(m) - //(m))(/(m) - //(:m'))]. 


(4.2) 
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The GPM can be written through a number of input points M as [79], 


z 

= N 



K(M,M) + cr 2 I 

f* 


li* 




K(M,M*) ]\ 

J) 


(4.3) 


in which the n explored area observations z are represented through a mean estimate /j 
and a 77 x n covariance matrix K(M,M) [79]. The 77 * unexplored area observations f* are 
represented with a mean estimate //* and 77* x 77* covariance matrix K(M f , M*) [79]. The n 
x 77* covariance matrix K(M, M*) represents the n explored estimates and the 77* unexplored 
areas [79]. 


4.1.1 The Matern Covariance Function 

The covariance matrix can be calculated from commonly used models such as spherical, 
squared exponential, Matern, or any other covariance function such that the covariance 
matrix is positive semi-definite [80]. In this thesis, the Matern model is used. It is generally 
considered to be the best choice for representing spatial differences between two points [81], 


K{M,M*) 


2 1 ~ v V2v|M - M* 

W)\ 4 


Ky 


'V2v|M -mA 

1 ) 


(4.4) 


in which v and A are non-negative parameters, K v is the modified Bessel function of the 
second kind, and T(v) is the Gamma function. 


4.1.2 Variogram 

The parameters, v and A, used for the Matern covariance function (equation 4.4) can be 
determined through the use of a semi-variogram, y(m, h) or variogram, 2y(m, h ), [80] 

2y(m, h) = E[f( m) - /(m + h)] 2 , (4.5) 

in which m is the position in the map and h is the distance from location m. The variogram 
and covariance functions are conceptually related, with the variogram representing the 
dissimilarities of the terrain and the covariance depicting the similarities of the terrain. 
Further, both the covariance and variogram maintain an underlying hypothesis that the 
variogram and the covariance function only depend on the distance h between two points, 
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which allows for the covariance C(h) and variogram 2y(h) to be expressed as only a function 
of distance h. The covariance-variogram relationship is further described in Appendix A, 
with the resulting mathematical relationship between the two represented as [80] 

C{h) = y(oo) - y(h). (4.6) 


The variogram, shown in Figure 4.1, can be defined through the properties of sill, range, 
and nugget, 



Lag (meters) 

Figure 4.1. Variogram Terminology: Range, Sill, and Nugget. Source: [82], 


in which the sill is the maximum value of the variogram plateau [82], the nugget describes 
the measure of dissimilarities at close distances (located at the y-intercept) [82], and the 
range is the distance between the location of the nugget and the location of the start of the 
sill [82], 

The centralized hypothesis of the variogram—the variogram only depends on the separation 
between two points—allows for the creation of the variogram estimator [80], 

N(h) 

2y'(h) = — £ C/CmO “ /(■* + A)] 2 . (4.7) 

in which N(h) is the number of data points between /(m,) and /(m ; + h). 
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Applying the variogram methodology to the bathymetry data used in this thesis, the ini¬ 
tial spatial relationship between depth and distance is determined. Using equation 4.7 a 
variogram is created, as shown in Figure 4.2. 



Figure 4.2. Variogram Estimated from Experimental Data using 100 Parti¬ 
tions. 


4.1.3 Fitting Matern Covariance Function to the Variogram 

The Matern covariance function, equation 4.4, has two non-negative parameters: v and 
A. Using the relationship between the variogram and covariance, equation 4.6, and the 
variogram, Figure 4.2 (right), the covariance as a function of distance can be calculated and 
fit using the the Matern function through a least squares method, producing Figure 4.3. 
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Figure 4.3. Fitting the Variogram using the Matern Function, (red) Vari- 
ogram, (blue) Matern Function. 


4.1.4 Lagrange Multiplier 

With the Matern covariance function, the predictive distribution has a mean and a covariance 
which can be written as [79] 

/* = K{M*,M)[K{M,M) + o-l I]~ l z (4.8) 

o-; = K(M*,M*) - K(M*,M)[K(M,M) + a (4.9) 

To ensure a minimum bias, a Lagrange multiplier can be used [80] to subtract out the local 
mean from f*. To simplify equations 4.8 and 4.9, a short form will be used such that 
K(M*,M) = [K*\, K(M*,M*) = [*„*], K(M,M) = [K], and [A] 7 = K(M*,M)[K(M,M) + 
cr,;/] -1 . Equations 4.8 and 4.9 can be rewritten in the matrix form [80], 

./* = [A] T [z] (4.10) 

o* = [K»\-[A] T [K.l (4.11) 
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in which [z] is the sampled depths, and [d] 7 is the transpose of the weighting matrix [dj. 

For this thesis, the GPM uses a simplification to have one test point calculated from 
surrounding points. The system of equations for spatial optimization with the Lagrange 
multiplier kl can be written as [80], 


[K][A] = [K.], 


in which 


[K\ 


C(M\,M\) ... C(M\,M n ) 1 
: : 1 
C(M n ,Mi ) ••• C(M n ,M n ) 1 

1110 


fit 




An 

-kl 


and 


[K*] 


C(M U M *) 

C(M n ,M*) 

1 


(4.12) 


(4.13) 


(4.14) 


(4.15) 


in which ) is the covariance determined from Figure 4.3 based on the distance 

between M i and M 2 . 

Solving the spatial optimization (equation 4.12) [80], 

[d] = [iq-to (4.16) 
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can then be used to solve for the predictive distribution mean and covariance, reiterated 
below [79]: 

/* = [X\ T [z] (4.17) 

a1 = [K„]-[A] T lK t ]. (4.18) 

4.2 Constructing A Real-Time Gaussian Process Model 

Normally the covariance function is calculated through a batch process. In this thesis, the 
covariance determination occurs when the terrain is learned through a Bayesian update [83]: 
using the Bayesian prior and the current observations to calculate an updated mean and 
covariance. 

4.2.1 Establishing the Prior Matern Covariance Function 

The prior Matern covariance function uses the isotropic assumption to show the similarity 
of the terrain as a function of distance, regardless of direction. Through second-order 
stationarity, the covariance between two closer areas gives higher similarities, while further 
distances provide lower covariance values. The covariance relationship decreases, with 
negative concavity, as a function of distance [83]. Therefore, the initial Matern covariance 
function can be described through 


z = C(0) - rh 2 (4.19) 

in which C(0) is the initial covariance, h is the distance between two measurements, 
r = C(0)/r 2 , and r is the range. 

Additionally, the initial Matern covariance function is expected to provide an inadequate fit, 
with the fit being continuously improved as exploration occurs. 

4.2.2 Update Matern Covariance Function 

As exploration occurs, the Matern covariance function is adjusted using the known regions 
of the GPM. Using a Markov Chain Monte Carlo (MCMC) approach, the posterior semi- 
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variogram, f, can be determined [84], 


7(h) = 


-W Z 7v(h), 


V -W 


v=W +1 


(4.20) 


in which y v (h) are prior semi-variograms, V is the length of the Markov Chain, and W is 
the amount of initial iterations to be discarded [84]. 


Using the variogram-covariance relationship, equation 4.6, converts equation 4.20 to 


C(h ) = f(oo) - f(h). 


(4.21) 


a posterior covariance function—computed only using explored areas—that can be modeled 
using the Matern covariance function. As exploration occurs, the Matern covariance func¬ 
tion becomes more refined, allowing for better mean and covariance predictive distributions 
(equations 4.17 and 4.18) to be calculated in real time. 


4.2.3 GPM Update 

As exploration occurs, the GPM is updated through the predictive distributions (mean 
and covariance), which change as the AUV path length increases and exploration occurs. 
The initial zero mean, constant covariance model (in this thesis we use an initial constant 
covariance equal to four) is updated to the predicted mean and covariance, as illustrated in 
Figure 4.4 and Figure 4.5. 
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Figure 4.4. Gaussian Process Model Predictive Mean Updates. 


The mean model (Figure 4.4) starts with all areas equal to zero. As the vehicle discovers 
new areas, each explored area is updated through the predictive mean (equation 4.17). The 
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vehicle track is shown as a dashed line, with unexplored areas shown in purple and explored 
areas shown in shades of yellow. Additionally, as the vehicle explores more area (Figure 
4.4b), the mean map more resembles the true bathymetry of the area. 
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Figure 4.5. Gaussian Process Model Predictive Covariance Updates. 


The covariance model (Figure 4.5) is initialized to a high uncertainty of the bathymetry, 
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~ N(4, cry). As the vehicle explores or re-visits areas, the uncertainty of areas visited can be 
reduced using the predictive covariance (equation 4.9), shown in Figure 4.5a. The vehicle 
track is shown in a black dashed line, and areas of vehicle path intersection results in lower 
uncertainty (covariance) of the bathymetry. As more of the map is explored (Figure 4.5b), 
the uncertainty about the map decreases. 

As the AUV explores more area, the GPM will provide a more accurate representation of 
the terrain, further resulting in a better ability to conduct exploration. 

4.3 Kullback Leibler Divergence 

The Kullback Leibler Divergence (KLD) computes the difference of mutual information 
between two probability distributions [85]. In this thesis, the KLD is used to determine the 
exploration score by calculating the information gain for each possible trajectory. Using the 
GPM mean and covariance predictions for each cell the UV drives over, the KLD can then 
evaluate each trajectory to determine the best course. The KLD between two distributions, 
p(m) and q(m), is defined as can be computed using equation 4.22 [85]: 

KL(p\\q) = J p{ m)log(^^)dm. (4.22) 

The KLD can be simplified by expressing both p(m) and c/(m) as Gaussian distributions: 
p( m) with mean p\ and variance rrj 2 and g(m) with mean pn and variance cry. Simplifying 
equation 4.22, as described in Appendix B, yields equation 4.23: 

/C£.(p|| g ) = log(—) + +(/ “ 2 ~ /l2) (4.23) 

\<r 1 / 2 cry 2 

The KLD is normalized to ensure an exploration score per time-step 6 [0,1], 

KL Norm (p\\q) = 1 - e- KL WM (4.24) 

which is then averaged over the entire trajectory to give the combined exploration score. 
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4.4 Exploration Analysis 

The exploration analysis is broken into two parts: evaluating the GPM at various stages of 
the UV path and evaluating future trajectories to ensure exploration is maximized. 


4.4.1 Evaluating the GPM 

As the AUV conducts exploration , the GPM converges to the true terrain, as seen in Figure 
4.4 and Figure 4.5. The AUV begins with a zero-mean, constant covariance model over 
the entire area and updates the sensor area with the prediction mean and covariance as 
exploration occurs. The reduction in spatial covariances can be seen in Figure 4.6, where 
the lowest covariances correspond to areas of intersection with the previous path of the 
AUV. As the AUV path length increases from Figure 4.6 to 4.7 to 4.8, the covariance of 
the map decreases. 
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Figure 4.6. Top-Down Covariance GPM Prediction and AUV path (black 
dashed line) for AUV Path Length of 2000m. 
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Figure 4.7. Top-Down Covariance GPM Prediction and AUV path (black 
dashed line) for AUV Path Length of 4000m. 
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Figure 4.8. Top-Down Covariance GPM Prediction and AUV path (black 
dashed line) for AUV Path Length of 6000m. 


As exploration occurs, the GPM changes to more resemble the true bathymetry of the map. 
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Looking horizontally at the GPM, depicted in Figure 4.9, shows the mean and covariance 
of the GPM along the red trajectory shown in Figures 4.6-4.8. 
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The initial AUV GPM, 2000m GPM, 4000m GPM, and 6000m GPM show an update to the 
true terrain mean and a decreasing covariance trend, corresponding to exploration occurring 
and map uncertainty decreasing. 

4.4.2 Evaluating Future Trajectories 

The GPM contributes to future trajectory evaluation by providing an exploration score. The 
highest exploration score parallels the highest information gain, whether that be exploring 
new areas or reducing the map uncertainty by re-covering previously explored areas. Look¬ 
ing at Figure 4.10, the purple constitutes uncharted regions, various yellow shades represent 
charted areas corresponding to depth, and the red trajectory shows the best trajectory for 
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exploration. The other trajectories not equivalent to the maximum exploration score are 
shown in green, yellow, and blue which correspond to exploration scores within the top 
third, middle third, and bottom third, respectively. All trajectories leaving the desired map 
region were not plotted for clarity. 
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(a) Trajectory 26. 


(b) Trajectory 27. 
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Figure 4.10. Trajectory Evaluation for Trajectories 26-29 to Maximize Explo¬ 
ration. AUV only knows explored regions in yellow and unknown unexplored 
regions in purple. Red trajectory shows maximum exploration and green, yel¬ 
low,and black show upper third, middle third, and bottom third exploration 
scores, respectively. 


In trajectories 26-29 of Figure 4.10, the exploration component attempts to explore new 
areas and shows lower exploration scores for previously explored areas or areas with low 
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map uncertainty. In trajectories 27 and 28, the explored region has high map uncertainty, 
due to the AUV exploring other regions first, which leads to the AUV choosing to enter a 
previously explored region slightly earlier than other trajectories. Therefore, the maximum 
exploration trajectory can be validated as the trajectory with the highest information gain 
through either exploring new areas or reducing the map uncertainty. 

4.5 Exploration Conclusions 

The GPM allows for each individual cell of a larger map be quantified by a mean and a 
covariance. Using the GPM, exploration can be quantified as the information gain over a 
trajectory. Through exploration , the Matern covariance function becomes more refined and 
the map uncertainty is lowered through either discovery of new areas or uncertainty reduction 
in high uncertainty areas. Exploration can be maximized through the employment of the 
KLD, which can compare each trajectory using the GPM mean and covariance. Lastly, with 
the constant incentive of exploration , the UV will provide complete coverage of a defined, 
bounded region which accomplishes the complete coverage criteria in Section 1.1. 
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CHAPTER 5: 
Exploitation 


Exploitation is the use of previously determined distinguishable terrain to localize the 
position of the UV, reducing the UV positional uncertainty to that of the terrain uncertainty. 
With lower positional uncertainty, the UV measurements reduces the original error and can 
more accurately map the terrain. 

It is also important to distinguish the difference between position estimation discussed in 
Chapter 3, and exploitation discussed in this chapter. Position estimation, by means of a 
particle filter (PF), is used to minimize positional uncertainty of the UV based on errors 
introduced through sensors, modeling, and control actions, whereas exploitation uses a 
previously discovered terrain to localize position by loop closure, anchoring the positional 
uncertainty to a more recent UV position. 

This section discusses, derives, and implements the exploitation component of the UV, 
which can be broken down into four parts: classifying the terrain using Boltzmann entropy, 
defining requirements for localization, introducing the Maximum A Posteriori (MAP) esti¬ 
mate, and applying the MAP estimate. 

5.1 Boltzmann Entropy 

The landscape ecology community recently classified terrain using the second law of ther¬ 
modynamics [60], [61]. Configurational entropy, specifically Boltzmann entropy, represents 
the spatial unpredictability of the terrain [60], [61] and therefore quantifies the amount of 
information within the topology. The proportional variability of the ocean floor directly 
correlates to the amount of information that a sensor can detect, and further, can use to 
locali z e the position of the UV. Boltzmann entropy computes the number of permutations 
of a sample based on two assumptions: the sample is time independent and there is no prior 
knowledge of the sampled area. For most terrain mapping applications, both assumptions 
can be met. For time varying systems (e.g., sand ripples or dredging operations), the as¬ 
sumption can still apply for a small enough time interval in which the sampling occurs. The 
Boltzmann Entropy can be defined as S = k/,log(VF) where kb is the Boltzmann constant 
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and W is the number of permutations within the sampled state. In this thesis, the Boltzmann 
Entropy is proportional to the number of permutations, and will be defined solely through 
its number of permutations 

£(m„) oc log(W), (5.1) 

in which (r(m n ) is the Boltzmann entropy at position m„ and W is the number of permuta¬ 
tions. 


5.2 Requirements for Localization 

Through a Boltzmann entropy representation, terrain can be classified following the ap¬ 
proach derived in [78] in which the Boltzmann entropy corresponds to the terrain uncer¬ 
tainty. A high Boltzmann entropy (high terrain uncertainty) translates to terrain features 
that have a strong potential to localize the position of the vehicle through the correlation 
method. Referring to Figure 5.1, the Boltzmann entropy quantifies the topology (5.1a) with 
large values corresponding to terrain features shown in the contour map (5.1b). With all 
sensors described in Section 1.6, the sensor accuracy is significantly better than that of the 
estimated map. Therefore, the localization is dependent, and limited by, the map resolution 
and is not reliant on any onboard systems. 
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Figure 5.1. (a) Boltzmann Entropy Map (S)and (b) Ocean Floor Bathymetry 
adapted from [78]. 
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Due to the small amount of data available, two assumptions are needed in order to statistically 
evaluate the data and apply the model to any type of terrain, allowing the vehicle to operate 
anywhere as opposed to just one 255m x 255m area. The first assumption is that the 
255m x 255m meter area is ergodic: the data contains all the properties of the larger area 
(specifically the ocean) and can therefore sufficiently model all areas of terrain. The second 
assumption is that the terrain can be modeled as a multi-variate Gaussian distribution—that 
a smaller sample of the map (in this case a lmx lm meter cell) can be represented fully 
through a entropy mean and variance within the cell. 

The data from Figure 5.1 is tabulated, as represented by the box plot in Figure 5.2, which 
shows the high quantity of major outliers in the Boltzmann entropy map (S). Furthermore, 
each major outlier is significantly different, which allows the UV to use these high Boltzmann 
Entropy values to localize its position. 
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Boxplot of Entropy Values 



(a) 

Depth, [m] 



(b) 

Figure 5.2. Statistical Modelling and Visual Representation of Boltzmann 
Entropy, (a) Box plot Boltzmann entropy values from Figure 5.1. (b) Vi¬ 
sual representation of a portion of bathymetry map with highlighted terrain 
corresponding to a max entropy value of 300. 
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A Boltzmann entropy value of £(m) = 300 can be seen visually in Figure 5.2b, showing 
asymmetrical terrain. No matter the orientation of the sensor used, the orientation of the 
UV can be determined based on the features position on the generated map. 

To further demonstrate changes in Boltzmann entropy, Figure 5.3 and Figure 5.4 depict 
two drastically different areas with different maximum Boltzmann entropy (£(m) = 40 for 
Figure 5.3 and £(m) = 2160 for Figure 5.4). 

Referring to Figure 5.3 and Figure 5.4, its very clear that a higher Boltzmann entropy 
corresponds to a better probability of localization. Furthermore, Figures 5.2, 5.3, and 5.4 
show that at low entropy values there is a low probability of localization (fiat terrain) but 
once Boltzmann entropy reaches a certain magnitude, the localization probability plateaus 
due to the asymmetrical terrain which lends the likelihood of localization, P(X*|£(m„)), to 
be modeled as a cumulative distribution function (CDF). 
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Figure 5.3. Boltzmann Entropy Visual Representation max(£(m)) = 40. (a) 
Terrain Depth Profile, (b) Boltzmann Entropy Profile (above - green) and 
Terrain Depth Profile (below - grey). Sensor regions of Boltzmann entropy 
and terrain highlighted in black and pink, respectively. 
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Figure 5.4. Boltzmann Entropy Visual Representation max(£(m)) = 
2160.(a) Terrain Depth Profile, (b) Boltzmann Entropy Profile (above - 
green) and Terrain Depth Profile (below - grey). Sensor regions of Boltz¬ 
mann entropy and terrain highlighted in black and pink, respectively. 
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The data from Figure 5.2 is heavily skewed to the flat, smooth portions of the map (Boltz¬ 
mann entropy equal to 1). Localization requires asymmetrical, uncertain terrain and there¬ 
fore the flat areas were removed prior to modeling the probability of localization. Without 
the flat, asymmetrical areas, the Boltzmann entropy mean and standard deviation of the 
data from Figure 5.2 are //^>i = 83.0 and cr^>i = 74.9, respectively. With a specified mean 
and standard deviation of the Boltzmann entropy representation of the terrain omitting flat 
areas, the probability of localization, given the Boltzmann entropy at state m„ 

can be given through the Gaussian CDF, 


P(X* 


1 

2 


1 + erf 


, < 7 £> 1V2 


(5.2) 


with select inputs and outputs of the CDF function shown in Table 5.1. 


Table 5.1. Boltzmann Entropy and Relative Localization Probability. 


Boltzmann Entropy 

£(m n) 

Localization Probability 

P(X* |£(m„)) 

1 

13.7 % 

50 

32.9 % 

100 

59.0 % 

150 

81.4% 

200 

94.0 % 

300 

99.8 % 


Referring to Table 5.1 and Figures 5.2 - 5.4, a Boltzmann entropy value of £(m„) = 1 
corresponds with a low localization probability of 13.7%. Flat terrain still gives a chance of 
localization, but the most important aspect of P(X*|£(m„)) for £(m„) > 300 corresponds to 
about 100% localization, which matches with the asymmetrical terrain. Further examples 
of terrain and their respective Boltzmann entropy values are shown in Appendix C. 
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5.3 Maximum A Posteriori Estimate 

Taking a deterministic view of the Boltzmann map, a MAP estimate utilizes the most 
expected hypothesis to make a prediction [86], [85]. Within ATAN, the most expected 
hypothesis corresponds to the vehicle position estimate using the best correlation between 
the vehicle and the terrain, as discussed in Chapter 3. Through the position estimate, 
multiple trajectories can be generated, as shown in Figure 5.5. 
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Figure 5.5. Possible Trajectories for AUV Shown in Red. 


One disadvantage of the MAP estimate is the lack of an uncertainty measurement within 
the calculation [85], specifically the uncertainty with the generated Boltzmann entropy map 
(S). However, the lack of uncertainty within the generated Boltzmann entropy map (S) is 
small since the particle filter is accurate to within one meter and the vehicle map resolution 
is one meter. Therefore, in this thesis we utilize a deterministic approach to the Boltzmann 
entropy estimation and leave the stochastic system representation of Boltzmann entropy for 
future work. 

Through a deterministic approach, Boltzmann entropy and the posterior distribution of 
explored areas can lead to each trajectory (Figure 5.5) being quantified with a MAP estimate 
to give an exploitation score. The optimal exploitation trajectory 6* over all trajectories 0 
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can be written as 


6* = arg min ( - log P(A*|6()|, (5.3) 

in which P(X*\0j) specifies the localization probability along trajectory 6j consisting of n 
IID (second order stationarity) points m, n . Equation 5.3 can be written as a function of 
m (> n through averaging [87], 


P(X*\e i ) = ^Y J P(X*\m i/l ), (5.4) 

n= 1 

in which N is the number of discretized points along trajectory 6j. Averaging does provide 
drawbacks for the resulting distribution mean and variance if the combined distributions 
are dissimilar [87]; however, in this instance all probability distributions P(A*|m,„) are 
identical as each is derived from equation 5.2. Each state m ; n can be quantified through 
Boltzmann entropy such that equation 5.4 can be rewritten as 

1 N 

P(X*\6i)=-J]p(Z(mn)). (5.5) 

n— 1 


The likelihood of localization over trajectory 8j depends on whether the terrain has been 
explored and the Boltzmann entropy of the terrain. Terrain that has not been explored 
provides no point of reference to the AUV and therefore, no chance of localization. The 
Boltzmann entropy provides a heterogeneity measure of the terrain which can be used 
to determine the best trajectory of the AUV. However, when a full trajectory covers a 
previously explored region, areas of localization (high entropy) have the possibility of 
being marginalized by lower entropy areas on the trajectory. Additionally, if a trajectory 
has high entropy over a majority of the trajectory, that trajectory should be looked at more 
favorably than a trajectory with a singular high entropy position. Therefore, in this work 
we propose the following weighted average approach 


P(X*\0i) 


Z„= 1 g(mf,n) Ajfin rj) 


Z" ! *(m i,n) 


o. 


if ZjL 1 e(m ; >) > 0 

otherwise. 


(5.6) 
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in which 


e(m,i) 


£ X ( m n) , 
0 , 


if (m„) explored 
otherwise. 


(5.7) 


Multiple values of x (equation 5.7) were investigated and we selected x = 2 as it led 
to the best results for prioritizing large individual entropy values but also giving overall 
precedence to multiple high entropy values (£ > 300) over an entire trajectory. 

Applying equation 5.6 to the trajectories generated in Figure 5.5 yields Figure 5.6 in which 
6* is shown in red with above and below average exploitation scores shown in green and 
yellow, respectively. 
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Figure 5.6. Exploitation Score Trajectory Analysis. The best trajectory (red) 
which corresponds to highest MAP estimate with above (green) and below 
average (yellow) exploitation scores shown in green and yellow, respectively, 
(a) Trajectory analysis top-down view over Boltzmann entropy map (H). (b) 
Trajectory analysis projected onto a 3D entropy map. 
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Figure 5.7. Exploitation Score Trajectory Analysis Using Bathymetry. Four 
horizons consisting of 50m trajectories utilizing heading rates of [-2,-1,0,1,2] 
degrees/second. The best trajectory ( 6 *) is shown in red which corresponds 
to highest MAP estimate with all other exploitation scores shown in green. 
Trajectory analysis over terrain map. 


Results show the entropy classification and MAP estimate can be shown effective, as the 
best MAP trajectory (red) encounters the most irregular terrain, as illustrated in Figure 5.7. 


5.4 Application of the Maximum A Posteriori Estimate 

Within ATAN, the UV will not have an a priori map, which was used in the derivation and 
implementation of the MAP. This section replaces the fully observable prior map used in 
Section 5.3 with the map generated through the exploration of the UV. Furthermore, the 
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mean and variance inputs to determine the probability of localization (equation 5.2) are 
based off of only previously explored areas. The resulting MAP estimate is shown in Figure 
5.8. 
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(c) Trajectory 14. 


(d) Trajectory 15. 


Figure 5.8. Trajectory Evaluation over Trajectories 12-15 to Maximize MAP 
with No A Priori Map. AUV only knows explored regions (yellow) with 
unknown regions represented in purple. The maximum MAP is shown in 
red, with above average and below average MAP shown in green and yellow, 
respectively. 


Each projected 150m path in Figure 5.8 shows the optimal path (red) with the highest MAP 
score. In trajectories 12-13, the MAP is fighting to get back to previously explored areas, 
while in trajectories 14-15, the best trajectory goes over higher entropy areas (jagged terrain) 
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which allow the AUV to localize. As more of the map is explored, the MAP will have more 
trajectories to choose from, resulting in higher exploitation scores. 


5.5 Exploitation Conclusions 

The exploitation score is calculated using projected trajectories of the UV coupled with the 
MAP, which uses the Boltzmann entropy classification of previously explored terrain. By 
having a quantifiable measurement of the exploitation aspect, the UV has the possibility 
of factoring the exploitation score into the determination of its next trajectory, in which 
the UV can determine the best trajectory for localization. Through exploitation of the 
terrain, both the map and UV position errors are reduced, accomplishing the reduction of 
positional uncertainty objective stated in Section 1.1. Furthermore, the exploitation score 
feeds into the determination of the best trajectory for the UV, whether that be exploration , 
exploitation , or a combination of the two. 
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CHAPTER 6: 
Reward Function 


Exploration and exploitation are the two factors that must be optimized to efficiently and 
effectively map an unknown area. Exploration is seeking out new areas while exploitation 
is minimizing the position of uncertainty (PUC) to enable accurate navigation. This push 
and pull between exploration and exploitation is managed through a reward function, a 
mathematical framework that ensures that exploration and exploitation occur at the proper 
time. However, exploration (spatial estimator) and exploitation (temporal estimator) have 
different units and describe different qualities with different metrics, making it difficult to 
determine the proper weighting between them. This chapter introduces the overall format 
of the reward function and discusses appropriate gains for exploration and exploitation. 

6.1 Reward Function 

The reward function is a result of the dual optimization calculation that determines the 
reward, R(x s ), of state x s as a function of three components: the exploration score, Yi; the 
exploitation score, To; and a penalty for going outside the boundaries of the desired area, 
(. Poff )• The relationship between the exploration score, exploitation score, and out of area 
penalty can be formulated into an equation, 


R(x s ) = k\T\ + kj To - Pqff, 


( 6 . 1 ) 


where k\ and ko are the respective gain values for exploration and exploitation , and Poff is 
the out-of-area penalty. The subsequent sections showcase different methods for selecting 
values for k\ and ki, specifically constant gains, adaptive heuristic gains, and optimal 
estimator gains. 

6.1.1 Constant Gain Model 

Constant gains are the easiest to model, in which the magnitude of the gain relates to the 
importance of the metric. In this case, exploration and exploitation are equally weighted, 
k\ - ko = 0.5, as both are equally important. 
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However, constant gains do not allow for exploration nor exploitation to take precedence 
over the other. In this thesis, the constant gains of k\ = ko = 0.5 set a minimum baseline 
for the interaction between exploration and exploitation. 

6.1.2 Adaptive Heuristic Gain Model 

The prioritization between exploration or exploitation can be determined through adaptive 
gains. For example, when exploration takes precedence, the exploration gain will be higher 
than that of the exploitation gain, resulting in a higher reward for paths that have higher 
exploration scores. 

The exploration heuristic adaptive gain is a function of the map area covered. As a larger 
map area becomes explored, it becomes more difficult to find unexplored areas and, in order 
to prioritize exploration , the exploration gain needs to be increased. The following function 
is used for the exploration heuristic adaptive gain: 


k\ 




< 


i-f’ 


[k 1 


,max> 


for k ] < k] max 
otherwise, 


( 6 . 2 ) 


where £ is the fraction of the map area covered. Since effective exploration cannot occur 
without localization, the explore score is capped at a maximum gain value k\ jnax (this thesis 
uses kijnax = 10) such that exploitation can still occur if both are required. 


The heuristic adaptive gain for exploitation is twofold: first, the algorithm changes to prior¬ 
itize closer localization points, and second, ko, increases without bound until localization 
occurs. First, to prioritize closer high entropy areas, the MAP estimate utilizes an additional 
weighting component, 

m„)) = N, N-1,.,.2, 1 (6.3) 

in which N is the number of discretized points along trajectory dj, to calculate the likelihood 
of localization, 


P(L\0i) = 


' Zf = i c(m,.„) w(£(m„)) P(g(m„)) 
z^li c(m,-,„) w(f(m„)) 

0 , 


if ZjLi c(m i,n) > o 

otherwise, 


(6.4) 
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in which 


c(m ;> ) 


0 , 


if m i n explored 
otherwise. 


(6.5) 


Second, the exploitation heuristic adaptive gain, k 2 , increases without bound until localiza¬ 
tion occurs and is a function of the AUV position error, 


k 2 = Sfto 2 (6.6) 

in which Sf is a scaling factor, oj is the position error and a function of the position estimate 
Xk and dead reckoning system model xdr, defined as oj = \\xu - xdr\\- For this thesis, 
Sf = 15 was used to prioritize exploitation over exploration. Since exploitation is critical 
in order to create an accurate map, the exploitation gain increases without bound such that 
no matter how large the exploration gain (maximum of 10 for this thesis), the exploitation 
score will eventually gain precedence and localization will occur. 

6.1.3 Adaptive Optimal Estimator Gain Model 

The optimal estimator gain values k\ and k 2 are derived in Appendix D as a function of 
uncertainty between the PF and the GPM, 


k\ 


crx 


~ 9 , ~ 0 

°T + 


(6.7) 


k 2 


cr: 


cr: + cr. 


V 


( 6 . 8 ) 


where <r~ is the normalized variance of the GPM and cry is the normalized variance of the 
PF. 


The exploration gain constant, k\, is dependent on the variance of the position estimator, 
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and is calculated as the variance of the distribution of the particles, 


1 N N 

*2 (6 - 9) 
i=l j =1 

in which ./V is the number of particles and x ; - xj is the distance between particles z and j. 
To normalize equation 6.9, an exponential decay model is used, 


cr. 



( 6 . 10 ) 


If the PF has a low distribution variance, the UV has a low positional uncertainty, which 
makes cry larger, increasing k\ and prioritizing exploration. If the PF distribution is large, 
then rry is small which lowers the exploration gain so that exploitation can be performed. 

The global estimator variance, crf, combines the variances of the GPM and is defined as [80] 

.. M 2 .MM 

= Xr- H + tP 2 £ - z «X z r - z yi. C 6 . 1 :i) 

;=1 (=1 j±i 

in which M 2 is the number of elements (M x M grid), err is the predictive variance per cell 
of the GPM, Zy is the mean value per cell, and Z* is the predictive mean value per cell. 
Taking into account second order stationarity, equation 6.11 simplifies to 


cr 


2 

1 


1 


M 2 



( 6 . 12 ) 


Equation 6.12 can be re-written for an L by H area as the average variance of the GPM of 
the map: 


cr 


2 

1 


1 

LH 


L H 


1=1 h =1 


(6.13) 
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To normalize the global estimator variance, the exponential decay model is used, 

of = e~< (6.14) 

As the map becomes more explored, the overall GPM variance (of) decreases, increasing 
the normalized GPM variance (of) and raising the exploitation gain constant (k 2 ). As a 
larger map area gets explored, a higher exploitation gain value is provided as exploitation 
becomes more necessary to ensure accuracy of the UV position. 


6.1.4 Out-of-Area Penalty 

The out-of-area (OOA) penalty solves two issues: (1) minimize the time the UV is outside 
the prescribed map and (2) ensure that the UV is able to map the sides and corners of an 
area without incurring the OOA penalty. The UV will not abide by the boundaries of the 
map if the penalty is too small, and the UV will not approach the boundary if the penalty is 
too large. In this thesis we propose an OOA penalty of the form, 


P OFF 



lor d < dA 
otherwise, 


(6.15) 


in which d is the distance off the map and c /4 is the allowed distance. For this thesis, c /,4 = 10 
to allow for access to the corners and edges of the map. 


6.2 Reward Conclusions 

The reward function utilizes three different methods to best determine the weighting between 
exploration and exploitation through a constant gain model, an adaptive heuristic model, 
and an optimal estimator model. These three different methods for determining exploration 
and exploitation gains allow for a comparison between each approach through a trajectory 
evaluator (discussed in Chapter 7), to determine the best way to leverage exploration and 
exploitation. 
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CHAPTER 7: 
Trajectory Planner 


Only about one percent of the ocean floor has been explored [88], leaving a massive, 
unknown search space with minimal areas to perform exploitation. In order to decide 
which trajectory is the best of the set, the trajectory planner leverages exploration and 
exploitation through the reward function, as previously discussed in Chapter 6. However, 
this problem is computationally complex, known specifically as PS PACE-complete [89], 
which makes solutions intractable. Therefore, limitations and constraints are placed on 
the UV and a partially observable Monte Carlo planning process (POMCP) — a POMDP 
coupled with a MCTS — is used to significantly reduce the computational complexity to 
make this problem solvable in real time on a computationally limited UV. 

7.1 Computational Perspective 

The world doesn’t yet possess the computational firepower to fully solve the dual optimiza¬ 
tion problem between exploration and exploitation with infinite trajectories to evaluate over 
infinite horizons, even for an area that is as small as 255m x 255m. Using vehicle sensors, 
the minimum path length can be computed using the search area and sensor size. For 
example, if the search area A is a 255m x 255m area and the sensor size S equals 60m, 
the minimum path length L for complete coverage with no sensor overlap can be calculated 
though A = SL which results in a path length of L = 1093 meters. Constraining the vehicle 
to a 50 meter path length that allow for constant heading rates changes of [-2,-1,0,1,2] 
degrees per minute, the computational complexity over the total path length reduces to 5 22 
or ss 10 15 computations solely for the trajectory planner. While there are computers that are 
on the petascale computing level [90], they are very expensive, too large for an AUV, and 
overall not feasible. 

The REMUS sensor size is around ten meters which translates to a minimum path length of 
3277 meters and a computational complexity equal to 5 66 or « 10 46 . Thus, the computational 
complexity of this problem is too large as the problem stands, but recent Machine Learning 
advances in Reinforcement Learning have provided a path to indirectly solve for the solution, 
using the combination of Decision Theory and Game Theory called a partially observable 
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Monte Carlo planning process (POMCP), which is used as the trajectory planner. 


7.2 Decision Theory 

Decision Theory combines probability theory and a utility function to calculate a reward 
value for a framework for decision sequences that involve an aspect of uncertainty [7]. The 
basic building block for Decision Theory is a Markov Decision Process (MDP), which can 
be expanded into a partially observable Markov Decision Process (POMDP) for partially 
observable states. Further, the same methodology can be performed over a finite region 
when the POMDP computational complexity becomes too large. 

7.2.1 Markov Decision Process (MDP) 

A Markov Decision Process (MDP) defines a model that can be used to solve a sequen¬ 
tial decision problem for a stochastic, fully observable environment that consists of four 
components [86]: 

• A set of states, s e S with initial state .vq. 

• A set of actions, a e A, for each state. 

• A transition state model, P(v ’\s,a), that calculates the probability of reaching state 5 ’ 
if action a is applied to state 5. 

• A reward function, R(s ) = E[r t+ i\s t = s,a t = a] [6]. 

The solution, known as a policy, tt,-, consists of an action sequence that the agent should 
employ from the initial state. The Optimal Policy, n* , is the policy that produces the highest 
expected reward. 

For example, consider the problem from [86], as shown in Figure 7.1, in which the agent 
starts at an initial state, .vq, with the goal of reaching either terminal states (+1,-1), with 
the (+1) terminal state being preferred. The stochastic environment may lead to a different 
policy, 7T;(a), for each iteration of the planning process that may or may not achieve the 
terminal state. After the planned policy is generated, the policy is evaluated based on the 
reward of the policy: the combination of the utility and probability of reaching the desired 
state. 
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Figure 7.1. (a) Simplified Environment for Sequential Decision Problem with 
terminal states +1 and -1 and all other states yielding a reward of -0.04, 
with exception of dark square which is an obstacle, (b) Transition Model 
Illustration: 0.8 probability of intended movement, 0.2 probability agent 
moves in unintended direction. Source: [86]. 


The optimal policy, n*(a), would be the policy with the best expected reward, for example 
a = [UP, UP, RIGHT, RIGHT, RIGHT ] [86]. The utility of the optimal policy is 
the sum of the individual utility of each cell in the path, calculated as Uj = -0.04 x 
5 + 1 = 0.8. The -0.04 reward for each state gives the agent an incentive to move, as 
opposed to a +0.04 reward where the agent stays without the risk of being penalized and 
accumulate utility. Each action of the optimal policy, n*(a), has a success probability 
of 0.8 of moving correctly, P$ = 0.8 5 . There is also a possibility that the agent plans 
to follow the optimal policy and accidentally reaches the terminal state, for example a = 
[ RIGHT, RIGHT, UP, UP, RIGHT], which gives a failure probability of P F = 0.1 4 . 
The total probability of success in this instance would be the combination of the success 
and failure, Pj = 0.8 5 + 0.1 4 = 0.32776. The other probabilities of failure are not possible 
as either they go through the (- 1 ) state, or take more than five moves to achieve the terminal 
(+ 1 ) state. The reward can then be calculated as R = UjPt - (0.8)(0.32776) = 0.262208. 

MDPs work well in fully observable environments where the utility of each state is known 
and the actions are probabilistically reliable. However, as the problem referenced in Figure 
7.1 grows in size, the computational requirements to determine the optimal policy increase 
exponentially. Further, in most environments, the actions will not be as reliable, and the 
states for each action also may not be known. Mapping an underwater area of 255m x 255m 
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meters where the initial, interim, and terminal states as well as their respective utilities and 
probabilities are unknown makes the MDP incapable of determining the optimal solution, 
as the problem is no longer fully observable. 

7.2.2 Partially Observable Markov Decision Process (POMDP) 

Relaxing the fully observable assumption, the agent no longer knows its present state and 
the problem shifts to a partially observable environment and requires a partially observable 
Markov Decision Process (POMDP) to develop a solution. The POMDP still utili z es 
the same basic elements of the MDP—a set of states s 6 S, a set of actions a e A, a 
transition model P(s'|s, a), and a reward function R(s)—while adding an observation model 
P(o|s) to calculate the probability of an observation o at state s [ 6 ], [ 86 ]. The observation 
model consists of the observation, o e G, which is then used to calculate the observation 
probabilities P(o|s) [ 6 ]. 

The history, h, = [ai, o\, ...a f , o t \ is the sequence of actions coupled with their respective state 
observations [ 6 ], as opposed to the MDP use of the utility of that state. The policy, 7r(h,a) 
correlates the history to the probability distribution of the actions. The reward is the total 
reward accumulated over the policy, R t = 'L™ =t y n ~ t r n , [ 6 ] where y is the discount factor and 
r n is the reward for each state of policy length n. The value function V' 7r (h) = E n [R r |h, = h] 
is the expected reward from state s while performing policy n. From the value function, the 
optimal value function is the maximum value function, V*(h) = max n V n ( h) [ 6 ]. 

For every POMDP, there exists at least one optimal value function, V*(h), and corresponding 
policy, 7r*(h,a), that gives the best solution [ 6 ]. 

7.2.3 Finite-Horizon Partially Observable Markov Decision Process 
(POMDP) 

The computational power required to compute the POMDP for all states to determine the best 
policy is not always available. In the case of the computationally limited REMUS vehicle, 
the POMDP requires an adjustment so that the REMUS can perform path planning in real 
time to map a 255m x 255m area. The Finite-Horizon POMDP limits the computational 
complexity of POMDP by only looking over a finite horizon, and the optimal solution can 
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be computed with the following equations [4]: 


<2r(b,a) = p(b,a) + y J i/(z'|b,a)V;i 1 T(b,a,z')dz' (7.1) 

z'eZ 

V;(b) = supQ t (b,a) (7.2) 

aeA 

7r*(b) = c/rg sup Q t ( b,a), (7.3) 

aeA 

where <2?(b,a) gives the total expected reward for t discrete periods when action a is 
performed at belief state b as a function of the initial condition, p(b, a) at t = 1 ; discount 
factor y e [ 0,1 ]; prior probability of observing i! given belief state b and action a, i](z' | b, a); 
and the Bayes Rule update r(b,a,z') [4]. Equations 7.2 and 7.3 show the smallest element 
of a 6 A that maximizes Q t . 


7.3 Game Theory 

Game Theory, an extension of Decision Theory, accounts for the interactions between mul¬ 
tiple agents [7], and has recently been used in computer-based versions of large-branching, 
turn-based games such as Go, Chess, and Checkers. For a coverage mission objective using 
ATAN with the REMUS vehicle, exploration and exploitation can be leveraged to search 
beyond the finite horizon discussed in Section 7.2.3 by using a MCTS and then coupling 
the MCTS, resulting in the overall methodology of a POMCP. 

7.3.1 Monte Carlo Tree Search (MCTS) 

Intractable integrals like equation 7.1 are approximated through a Monte Carlo tree search 
(MCTS) [7] that heuristically searches the integral decision space until either a solution is 
known or a search limit has been reached (e.g., temporal or computational). The MCTS has 
two fundamental properties: the true value of an action can be estimated through a random 
search iteration and subsequently be used to bias the policy and result in a best-first search 
strategy [7]. 


89 



Monte Carlo Tree Search (MCTS) Algorithm 

The MCTS is used to build and search a decision tree by sequentially selecting the best 
branches. The decision tree (shown in Figure 7.2) is anchored by a root node vq at state so, 
that contains the reward value for the node, Q(s), and a visit count for the node, N(s,a), 
that is connected to its child nodes based on actions a e A. Those child nodes become the 
parent node of new child nodes related by the same actions a e A. 

Each node is initialized to Q(s ) = 0 and N(s, a) = 0 and each MCTS iteration starts at the 
root node. Each iteration consists of three policies— tree, default, and backpropagation— 
which interact using Algorithm 4 shown in Figure 7.2. 



\ 

\ 

\ 

\ 

□ 


Algorithm 4: MCTS [7] 

1 function MCTS (so) 

2 create root node vo with state so 

3 while Within computational budget 

do 

4 vi <- TREEPOLICY(uo) 

s A <- DEFAULTPOLICY(s(tq)) 

6 BACKPROPAGATION(u/,A) 

? return a (BESTCHILD^o, 0)) 

8 end 


(a) 


(b) 


Figure 7.2. (a) MCTS Visual Description of the Decision Tree. Source: [7], 
(b) MCTS Algorithm. Source: [7], 


The relationships between the tree policy, default policy, and backpropagation are illustrated 
in Algorithm 4 and Figure 7.3. The tree policy selects and expands the search tree, the 
default policy produces a reward estimate of one random iteration of the tree search, and 
backpropagation adds the rewards of descendent children to the parent node. 
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In selection, the tree policy recursively chooses the best branch at each node tree through 
the upper confidence bound for trees (UCT) defined as [6], [7] 


Q-(s,a)=Q(s,a) + cJ^ y (7.4) 

As c —> 0, the UCT algorithm acts more greedily. Once every action from state s has been 
evaluated, the algorithm selects the action maximizing Q*. 

In expansion, the search is no longer bounded by the tree, and new branches need to be 
formed. More child nodes are added to the tree based on the actions available, a e A, by 
uniformly picking the subsequent action (modeled as a uniform random variable) [6]. In 
simulation, each node is evaluated based on the default policy. In backpropagation, the 
evaluation result is backtracked to the root node, where it updates the value Q(s) and the 
visit count N(s, a) for each node visited. 

The search terminates when either the search is complete or the computational budget is 
reached. At this point, the next action is determined using one of the following criteria: 
max child (highest value), robust child (highest visit count), max-robust child (both highest 
value and visit count), or secure child (maximize lower confidence bound) [7]. 

The search tree then conducts a pruning operation, removing all nodes not descendent from 
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the selected action child node, and the MCTS process commences again. 

Note there are two places where biasing can affect the outcome of the MCTS search — 
child selection policy and action selection criteria. 

Improving MCTS: Child Selection Policy 

The next action the AUV executes is determined through the MCTS child selection policy. 
There are four standard options [7]: 

• maximum reward child 

• most visited child 

• highest reward, most visited child 

• highest low confidence child. 

In this work we utilize the highest reward, most visited child with a secondary option on 
maximum reward child. As the MCTS will be completed as the AUV traverses a trajectory, 
the AUV requires a new command at the end of its current trajectory prior to the beginning 
of the new trajectory. If the highest reward, most visited child is not a viable option, the 
AUV will use the maximum reward child since there is no additional time for more iterations 
of the MCTS. The secondary option gives the AUV the best child through the maximum 
reward value, represented by the accumulation of the exploration and exploitation scores 
over the MCTS. 

Improving MCTS: Biasing 

Biasing the MCTS affects the action selection criterion by changing the c value in the 
MCTS equation 7.4. As the c value increases, the MCTS selects new areas to explore, 
weighting exploration of the tree more heavily. As the c value decreases, the MCTS acts 
more greedily, and instead chooses the next child based on which gives the highest reward 
(exploitation). 

Previous work selected c - 1 /V2 to satisfy the ffoefFding inequality for a reward within 
[0,1] [7], which also is in line with the scores of exploration and exploitation. 

In this thesis we explore an avenue to improve the results of the MCTS by using an adaptive 
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c value, specifically, 


c = 



(7.5) 


where k\ is the gain value for exploration , ko is the gain value for exploitation , and // is a 
constant. By making the c value proportional to the ratio of exploration to exploitation , the 
MCTS parallels the desire of the AUV for exploration or exploitation. 


7.4 Partially Observable Monte Carlo Planning 

The combination of the POMDP and MCTS yield the POMCP, as depicted in Figure 
7.4. The search horizon, illustrated with the red, yellow, green, and pink trajectories, 
depicts the POMDP consisting of four trajectories and the planning horizon portrayed in 
blue represents the MCTS equal to nine trajectories. The large decision space up to the 
planning horizon leads to a large computational complexity, with each additional search 
horizon exponentially adding more computational cost. However, the MCTS reduces the 
computational complexity to within the computational limit by advantageously searching 
the large decision space starting at the root node (AUV position) and terminating at the 
planning horizon through the child selection criterion and biasing. When one iteration of 
the search is complete, the reward for each individual trajectory is compiled using a discount 
factor, y, to ensure that the cumulative reward is more dependent on closer trajectories. 


93 



200 meter POMDP, 250 meter MCTS 
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Figure 7.4. POMCP Visual Description of POMDP trajectories (red, yellow, 
green, pink) and one MCTS trajectory (teal). 



Using a POMCP, a combination of the POMDP and MCTS, allows for a large decision 
space to be advantageously searched within a computational limit, and permits the planning 
process to look over a larger distance. With longer possible paths to evaluate, the AUV has 
the ability to better position itself to explore or exploit the environment and make a better 
decision for the current move based on potential future moves. 

7.5 Conclusions 

The POMCP allows for the determination of a viable solution within a large decision 
space that is formed while conducting trajectory planning by significantly lowering the 
computational complexity. By selectively biasing the POMCP, the trajectory planner 
becomes better at finding the best solutions in a more expedited timeframe, a concept that 
can be better portrayed by results shown in the next chapter. 
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CHAPTER 8: 
Results and Evaluation 


To accomplish autonomous navigation, the AUV must primarily be able to navigate without 
external localization sources or an a priori map. Additional secondary objectives include 
minimizing total positional error, completely covering a defined region with a designated 
bathymetric sensor, creating an accurate bathymetric map, and finishing the task within 
time and energy constraints. The primary and secondary objectives can be accomplished 
through the use of a position estimator and leveraging exploration and exploitation through 
a mathematical framework called a reward function. The reward function is then employed 
through a POMCP, which scales down the computational complexity while biasing the 
search to focus on the most advantageous areas. This chapter establishes a baseline metric 
for a coverage problem, compares the three reward functions discussed in Chapter 6, and 
evaluates the trajectory planner using results of the POMDP, unbiased POMCP, and the 
biased POMCP. 


8.1 Autonomous Underwater Vehicle Specifications 

The AUV specifications used to compile the results are listed in Table 8.1. 


Table 8.1. AUV Model Parameters. 


Heading Rates 

[-2,-§,-§,0,§,§,2 ]°/s 

POMDP Horizon 

3 trajectories 

MCTS Horizon 

10 trajectories 

AUV INS Error 

0.5% distance traveled 

AUV speed 

3.5 knots 

Trajectory length 

50 meters 

AUV Allowed Error 

1.5 m 

Number of Particles 

1000 
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8.2 Benchmark 

There must be metrics in place on which to judge the trajectory planner prior to discussing 
the results of the trajectory planner. This section discusses the appropriate benchmark 
through calculating the minimum time, discussing common mapping techniques and their 
respective abilities, and using more computationally complex methods with a perfect map 
to determine the best solution. 

8.2.1 Optimal Solution 

The best coverage path length can be determined with the equation A = SL, where S is the 
sensor coverage size, L is the number of discretized points over the entire path length, and 
A is the search area. The sensor size can be represented as two intersecting circles. As the 
AUV moves from one discretized step to the next on a trajectory, there will be some margin 
of sensor overlap. Therefore, the sensor size can be represented as the change in area of two 
intersecting circles [91], 

S = nR 2 - 2fl 2 cos -1 (;4) + \d^AR 2 - d 2 , (8.1) 

where R is the radius of the circle, and d is distance between the center of each circle. 
The overall best time to search an area can be determined using the speed of the AUV and 
number of discretized points L. In this case, the optimal time is ~ 17 min with a sensor 
radius of 10m and a speed of 3.5 knots. The optimal time is most likely not achievable 
based on vehicle motion dynamics and further does not allow the AUV to localize as no 
area of the map is re-visited. However, it is still a useful performance metric for the AUV 
trajectory planner as the ideal solution without localization. 

8.2.2 Common Techniques 

Three prevalent techniques to perform mapping of an area exist: box in mapping method¬ 
ology (BI), spiral out mapping methodology (SO), and mow-the-lawn mapping method¬ 
ology (MtL). Each method was performed with an AUV with identical speed and sensor 
specifications, specifically 3.5 knots and a 10 meter sonar sensor range, that performed 
the mapping of the 255m x 255m area with no sensor overlap of previously mapped areas 
and another with half sensor overlap of previously mapped areas to provide a change for 
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localization, as shown in Figure 8.1. 




Longitude, [degrees] 



Longitude, [degrees] 


(e) (f) 

Figure 8.1. Common Mapping Techniques: (ab) Bl, (cd) SO, and (ef) MtL. 
(ace) No Sensor Overlap (bdf) 50% Sensor Overlap. 
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The map area gets fully covered by each mapping technique, as illustrated in Figure 8.1. 
The left column of Figure 8.1 shows no sensor overlap, the quickest way to cover the entire 
area, whereas the right column shows a 50 % sensor overlap, an attempt to provide a balance 
between exploration and exploitation. The coverage at each time in the trajectory, along 
with the AUV position error is shown in Figure 8.2. 


COMPARISON OF COVERAGE - COMMON METHODS 



(a) 


COMPARISON AUV POSITION ERROR 



(b) 

Figure 8.2. Common Mapping Techniques Comparison in (a) Coverage and 
(b) AUV Position Error. 
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Looking at Figure 8.2 (a), the optimal solution (discussed in Section 8.2.1) maintains the 
best time to cover the specified area. The SO method with no overlap provides the best 
initial coverage by always exploring new regions and keeping the optimal solution until the 
AUV starts going over areas that are not required to be mapped. The BI and MtL methods 
with no desired sensor overlap are very similar, as each has minimal sensor overlaps when 
moving from one trajectory to another. However, there is no ability for the AUV to localize 
with no overlap, leading to higher AUV position error. Therefore, 50% sensor coverage 
trajectories were implemented to give the ability for localization, with SO performing the 
best for coverage, followed by BI and MtL, respectively. The BI method performs better in 
coverage than MtL as BI utilizes a full sensor over unexplored space for a longer duration 
prior to starting the 50% sensor coverage routine (AUV goes around the map fully in BI as 
opposed to back-and-forth in MtL). 

Figure 8.2 (b) depicts the AUV position error as a function of time, in which only 50% 
sensor overlap methodologies are included. The 100 % sensor coverage over new areas 
does not provide the ability to localize position, and therefore the error increases linearly as 
a function of distance. Of the three 50 % sensor coverage methods, BI and MtL are similar, 
with SO not performing as well. However, for these cases, the UV error is a function of 
the map terrain. Since the UV has a predetermined path, the UV cannot exploit areas that 
it already knows, and must get "lucky" for localization to occur. Regardless, results show 
AUV position errors obtained for the three 50% sensor coverage methods are above the 
1.5m specification indicated in Table 8.1, showing that predetermined paths are not a viable 
methodology. 

Difficulties with Common Techniques 

The difficulties presented with the common techniques are two-fold: maneuverability of the 
AUV and map dependency. 

First, the AUV cannot pivot ninety degrees instantaneously, requiring at a minimum, a 
rudder angle based arc to maneuver which leads to the inability of the AUV to follow the BI 
or MtL trajectories. Further, the shape of the search area adds an additional hindrance to the 
AUV, providing corners and straight edges that the AUV doesn’t have the maneuverability 
to properly handle. The only feasible trajectory, SO, covers areas outside the desired region, 
and once the AUV gets to the outermost edge of the spiral, the time to fully spiral around 
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grows significantly, making it a less desirable approach for larger areas. 

In the 50% sensor coverage case, the AUV has the ability to localize using previously 
explored areas of the map. However, by performing a predetermined pattern of movement, 
the AUV is unable to find and exploit the environment and is therefore dependent on the 
map features. 

8.2.3 Brute Force Results 

The computational requirement to solve this problem, even with the restrictions listed 
in Table 1.1, leads to an evaluation of ~ 10 46 trajectories, as shown in Section 7.1. The 
immense computational cost to determine the optimal solution was too large using resources 
available and instead we employed an alternative to determine an optimal trajectory within 
computational requirements of an intel XEON processor. Instead of determining the optimal 
solution, a six-horizon POMDP—more horizons than computationally feasible in real time 
for the REMUS vehicle—coupled with a perfect map was used to determine an optimal 
trajectory within computational limits. For comparison, one iteration of the six-horizon 
POMDP was completed in 32.2 min while an AUV is able to traverse a 50m trajectory in 

32.3 seconds. The six-horizon POMDP was run using all three reward functions starting at 
the same location, with the optimal estimator reward function providing the best coverage 
and AUV position error. 

8.2.4 Establishing the Benchmark 

Results obtained with the methods previously described are compiled in Figure 8.3 and 
Table 8.2 and are compared to evaluate an appropriate benchmark. 

Figure 8.3 depicts the coverage of the six-horizon POMDP with a perfect prior map, SO 50% 
sensor overlap, BI 50% sensor overlap, and MtF 50% sensor overlap, with the six-horizon 
POMDP with a perfect prior map performing the best from a coverage standpoint. 
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COMPARISON OF COVERAGE - COMMON METHODS 



— OPTIMAL COVERAGE 

-100 % COVERAGE LINE 

O PERFECT MAP POMDP 

SPIRAL OUT HALF OVERLAP 
-*— BOX IN HALF OVERLAP 
-* — MOW LAWN HALF OVERLAP 


40 

time (min) 


Figure 8.3. Comparison of Coverage Techniques between the six-horizon 
POMDP with perfect map, box in (Bl), spiral out (SO), and mowing the 
lawn (MOW) methods with 50% sensor overlap. 


Referring to Figure 8.3, the optimal solution and six-horizon POMDP with a perfect map 
were chosen as the appropriate benchmarks for coverage comparison. While the optimal 
solution is likely not feasible, it provides the best solution in terms of coverage based on 
AUV parameters. The six-horizon POMDP was also selected as it led to the best solution 
among all techniques investigated that allow for localization. 
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Table 8.2. Comparison of AUV Positional Error Using Coverage Techniques 
listed in Figure 8.3. 


AUV Error (meters) 


Method 

Avg 

Max 

POMDP 

0.19 

1.2 

SO 

- 

>10* 

SO 50% 

3.5 

>10* 

BI 

- 

>10* 

BI 50% 

1.8 

7.6 * 

MtF 

- 

>10* 

MtF 50% 

1.5 

6.9 * 


Table 8.2 shows the AUV position error for each method used in Figure 8.3. Results show 
the six-horizon POMDP gives the most realistic answer in terms of coverage and AUV 
position error, and will be used as the benchmark for this thesis. 

8.3 Reward Function Determination 

This Section discusses the three different reward functions presented in this thesis and shows 
results of each reward function using the POMDP and the unbiased POMCP. 

8.3.1 POMDP 

Ten simulations of the POMDP for each reward function was implemented using random 
starting points. The coverage and AUV error obtained for ten simulations was recorded to 
first evaluate the coverage and AUV position error. Shown in Figure 8.4 and Table 8.3, the 
mean and 95% confidence interval (Cl) were determined for the coverage and AUV error 
for each simulation, and the maximum AUV error was also recorded. 
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COVERAGE- POMDP 
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-OPTIMAL COVERAGE 

O POMDP PERFECT MAP 

- 100 % COVERAGE LINE 

NO MCTS -- CG -- 95% Cl 
NO MCTS - CG - 95% MEAN 
NO MCTS -- HG - 95% Cl 
NO MCTS - HG - MEAN 
NO MCTS - OE - 95% Cl 
NO MCTS - OE - MEAN 


100 


30 40 

time (min) 


Figure 8.4. POMDP Results for Coverage of Three Reward Functions Pre¬ 
sented in Chapter 6: CG - Constant Gain (Section 6.1.1), HG - Heuristic 
Gain (Section 6.1.2), and OE - Optimal Estimator (Section 6.1.3). 


From a coverage standpoint results show the heuristic gain (HG) and the optimal estimator 
(OE) have similar performance (OE performing slightly better) and both methods surpass 
constant gain coverage as illustrated in Figure 8.4. Further, the POMDP coverage results 
plateau at around 90% coverage, struggling to find new areas to explore from 30 min to 
80 min due to the inability to look further ahead toward trajectories that achieve complete 
coverage. 


103 









Table 8.3. POMDP Results AUV Position Error of Three Reward Functions 
Presented in Chapter 6: CG - Constant Gain (Section 6.1.1), HG - Heuristic 
Gain (Section 6.1.2), and OE - Optimal Estimator (Section 6.1.3). 


POMDP AUV Error (meters) 


Gain Type 

Avg 

95 % 

Cl 

Max 

CG 

0.25 

0.46 

1.47 

HG 

0.22 

0.34 

1.40 

OE 

0.19 

0.29 

1.14 


All three reward functions led to a maximum position error below the required 1.5m 
maximum. However, the average, 95 % Cl, and maximum position error between the three 
different reward functions, as shown in Table 8.3, display that the OE gain type leads to the 
lowest average, smallest 95 % Cl, and lowest maximum. 

8.3.2 POMCP 

Ten simulations of the POMCP were then run from random poses to determine coverage 
and AUV position error, as shown in Figure 8.5 and Table 8.4, respectively. 
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COVERAGE - UNBIASED POMCP 



^ooooooooooooo 

o ° ° ° 


-OPTIMAL COVERAGE 

O POMDP PERFECT MAP 

- 100 % COVERAGE LINE 

MCTS -- CG -- 95% Cl 
MCTS — CG -MEAN 
MCTS - HG - 95% Cl 
MCTS- HG -MEAN 
MCTS - OE - 95% Cl 
MCTS - OE - MEAN 


30 40 

time (min) 


50 


60 


70 


Figure 8.5. Unbiased POMCP Results for Coverage of Three Reward Func¬ 
tions Presented in Chapter 6: CG - Constant Gain (Section 6.1.1), HG - 
Heuristic Gain (Section 6.1.2), and OE - Optimal Estimator (Section 6.1.3). 


Results show HG provided the best coverage, with OE very close behind. However, the 
map area coverage still only reaches a maximum of around 93 % at around 80 min of the 
AUV search. 
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Table 8.4. Unbiased POMCP Results for AUV Position Error of Three Re¬ 
ward Functions Presented in Chapter 6: CG - Constant Gain (Section 6.1.1), 
HG - Heuristic Gain (Section 6.1.2), and OE - Optimal Estimator (Section 
6.1.3). 


Unbiased POMCP AUV Error (meters) 


Gain Type 

Avg 

95% 

Cl 

Max 

CG 

0.20 

0.32 

1.46 

HG 

0.22 

0.35 

1.61 

OE 

0.18 

0.31 

1.19 


The main difference between results obtained with the HG and OE gain values comes down 
to the AUV position error, shown in Table 8.4. The OE maintains significantly lower AUV 
position error while also not exceeding the maximum specified error of 1.5 meters. 

Overall, the best method for the reward function determination is through optimal estimator 
based on the coverage and AUV position error of the three reward functions within the 
POMCP framework,. 

8.3.3 Reward Function Conclusions 

Three different reward functions were evaluated through three different scenarios: 

• a priori perfect map, six-horizon POMDP (Section 8.2.3) 

• unknown map, three trajectory POMDP (Section 8.3.1) 

• the unknown map, POMCP utilizing three trajectory POMDP and ten trajectory 
MCTS (Section 8.3.2). 

Results showed the OE to be the best overall reward function in each of the three scenarios 
based on the coverage and AUV performance. 
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8.4 Partially Observable Monte Carlo Planning 


It is also important to determine whether the MCTS is worth the additional computational 
load by comparing unbiased POMDP and unbiased POMCP results. The coverage compar¬ 
ison for both the unbiased POMDP and unbiased POMCP utilizing all reward functions is 
shown in Figure 8.6. 


Comparison of Coverage 
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Figure 8.6. Comparison of POMDP and POMCP Coverage Performance. 

As seen in Figure 8.6, the addition of the MCTS allows the POMCP to attain better 
coverage than that without the MCTS. Furthermore, the addition of the MCTS also lowered 
or maintained the AUV position mean error throughout the entire trajectory. 

Overall, the MCTS only adds a computational cost as opposed to a temporal cost since 
the computational cost does not exceed the time required for the AUV to traverse a 50m 
trajectory. The MCTS only searches using the time remaining prior to the next decision, 
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and therefore is an improvement over a non-MCTS approach. 


8.5 Biased Partially Observable Monte Carlo Planning 

The MCTS balances exploration and exploitation through a constant c [6], 


Q*(s,a) = Q(s,a ) + c 


InN(s) 
N(s,a ) 


( 8 . 2 ) 


By increasing the c value, more exploration of the MCTS occurs. By decreasing the c 
value, more exploitation of known areas of the current search occur. Using ten scenarios 
starting from random poses and the optimal estimator (OE) method for the reward function, 
this thesis proposes a MCTS biasing option through a ratio of exploration and exploitation 
gain values, 



(8.3) 


Equation 7.5 is evaluated for each of the following rj values, rj - [-5,-1,0.5,1,2,3,4,10], 
in which the coverage and AUV position errors were recorded. For clarity, only rj values of 
rj - [-5,0.5,1,10] were plotted in Figure 8.7. 
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Biased POMCP 
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Figure 8.7. Biased MCTS Mean and 95% Cl Coverage Results for ^ = 
[-5,0.5,1,10], 


Table 8.5. Biased POMCP Results for AUV Position Error. 


Biased POMCP AUV Error (meters) 


V 

Avg 

0 

l-H la 

eR 

Max 

0.5 

0.22 

0.33 

1.76 

1 

0.23 

0.31 

1.45 

10 

0.33 

0.47 

2.52 

-5 

0.38 

0.57 

2.96 
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Referring to Figure 8.7 and Table 8.5, the best coverage and AUV position error occur for 
T] = 1, closely following the six-horizon, perfect map POMDP in coverage and maintaining 
a slightly higher average and maximum AUV position error (POMDP AUV position mean 
0.19 and average 1.2), but still within the overall specification of 1.5 meters. 

All tj values, r] = [-5,-1,0.5,1,2,3,4,10], were then used for a basis of comparison for 
coverage and AUV position error, as shown in Figure 8.8. 
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COVERAGE COMPARISON 



(a) 


AUV POSITION ERROR COMPARISON 



(b) 

Figure 8 . 8 . MCTS Bias 77 -Value Comparison, (a) Mean and 95% Cl Cov¬ 
erage for 77 = [-5,-1,0.5,1,2,3,4,10]. (b) Mean, 95% Cl, and maximum 
AUV Position error for 77 = [-5,-1,0.5,1,2,3,4,10]. 


Referring to Figure 8 . 8 , it is noted that as the 77 values get further from 77 = 1, the coverage 
95 % Cl, the AUV position error 95% Cl width, and the maximum value AUV position 
error, all increase. Therefore, 77 = 1 was chosen as the best biasing option for coverage and 


111 



















AUV position error considerations. 


8.6 Best Trajectory from Trajectory Planner 

The POMCP was run for ij = 1 using the OE reward function, achieving full coverage 
at 7250 meters, or approximately one hour. The step-by-step trajectory is provided in 
Appendix E, with the final trajectory and the AUV generated map error shown in Figure 
8.9. 



Figure 8.9. (a) Biased POMCP Result Trajectory and (b) AUV Map Error 
comparison to CSUMB Map [78]. 


The overall trajectory resulted in a maximum AUV position error of 1.47 meters, mean 
AUV error of 0.22 meters, with a 95 % Cl of 0.33 meters, narrowly achieving the 1.5 
meter error specification. Furthermore, the total map error, shown in Figure 8.9(b), has a 
maximum error of 0.08 meters, which is within the specified sensor error. 
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CHAPTER 9: 

Conclusions and Future Work 


The requirement for accurate bathymetric charts for the open oceans, ports, and channels 
has never been more apparent. The oceans cover seventy-one percent of the surface of 
the Earth [92], provide a means of transportation for ninety percent of world goods and 
resources [93], and play a part in approximately twenty one trillion U.S. dollars per year to 
human welfare (sixty percent from coastal and shelf systems and forty percent from open 
oceans) [92]. This thesis proposes ATAN within the underwater domain as a solution to 
quickly, efficiently, and accurately map areas while not requiring an a priori map or external 
localization sources. 


9.1 Thesis Conclusions 

This thesis demonstrates the ability of an AUV to navigate autonomously without external 
localization sources or an a priori map while also fulfilling the following objectives: 

• minimizing total positional error 

• completely covering a specified area with a designated sensor 

• creating an accurate map within certain AUV position error 

• completing the mapping problem within time and energy constraints. 

Results show that (1) the AUV total positional error was minimized to under 1.5 meters; (2) 
the full area was accurately covered, having a maximum error of 0.08 meters when compared 
to the real map; and (3) the AUV was able to map the entire area in approximately one hour, 
outperforming larger computational cost methods. 

9.2 Thesis Contributions 

This thesis demonstrated the viability of ATAN for autonomous vehicles and provides the 
following contributions to the area: 

• A real-time information-theoretic trajectory planner that adaptively leverages explo¬ 
ration and exploitation through a POMCP 
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• A real-time implementation of the GPM that maximizes the information gain of the 
selected trajectory 

• A real-time mapping solution that does not require a prior map or external localization 
systems 

• An analytical framework that allows for inspection and evaluation of different ap¬ 
proaches within ATAN. 

The autonomy of a UV can be measured through three constructs: energy, position esti¬ 
mation, and the ability to make decisions [94]. ATAN coupled with the REMUS system 
provides a capable energy source, accurate position estimation, and the ability to inter¬ 
pret, interact, and decide trajectories in new environments. Through this definition, ATAN 
emphasizes greater levels of autonomy for UVs and improves upon the foundation and 
application of TAN. 

9.3 Recommendations for Future Work 

This thesis covers multiple subject areas. Therefore, considerations for future work are 
broken up into sections parallel to the chapters of this thesis and also includes a section on 
computational complexity of this problem and obtaining in-water results. 

9.3.1 Position Estimator 

The position estimator uses a particle filter since ATAN is a non-linear, non-Gaussian 
problem. The particle filter characteristics can be potentially improved by refining its 
parameters. 

The particle filter was modeled using a circular sonar sensor as opposed to a line sensor that 
replicates the BlueView MB 2250. Using a line sensor instead of a circular sensor makes 
the particle filter correlation method a much harder problem, requiring a more accurate 
initialization and update process. This thesis attempted to fulfill the line sensor model; 
however, we were unable to implement it with reasonable accuracy. Results showed the 
initialization and first 1000 meters was fairly consistent with position estimation (error 
under three meters). However, the PF was unable to recover once the PF converged to an 
incorrect solution. Future work within the linear sonar sensor could include the modeling 
of the linear sonar sensor using a ray trace algorithm coupled with the INS system to better 
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correlate terrain while minimizing the probability of a false PF convergence, using [29] as 
a starting point. 

This thesis considered Gaussian probability distributions within the modeling of the particle 
filter, yet the particle filter is capable of handling non-Gaussian, multi-modal functions. 
Utilizing different types of probability distributions, or different selection criteria for the 
PF modes could increase PF performance. 

For different terrain classifications (e.g., flat, rough, soft, and hard terrain), an adaptive 
particle filter approach may lead to better position estimation. Using a combination of 
adaptive probability distributions with the terrain coupled with [31] may improve the overall 
PF performance. 

Lastly, there are numerous types of particle filters. In this thesis we considered the MCL 
variety, but a comparative analysis between different particle filters like the MCMC method 
or Rao-Blackwellized PF from a computational cost standpoint and positional estimate 
accuracy would help decide the optimal position estimation techniques to be used within 
ATAN. 

9.3.2 Exploration 

The exploration component of ATAN comprises of a GPM and a trajectory evaluation com¬ 
ponent that uses the GPM and KLD to determine an exploration score. Future improvements 
on the exploration component include different modeling of the semi-variance, utilizing dif¬ 
ferent covariance matrix functions, and clustering and combining areas of l ik eness together 
into separate GPMs. 

The semi-variance can be represented using an epi-spline which can then be optimized 
using individual sections [95], which may lead to a more accurate GPM. 

In this thesis we selected the Matern function to calculate the covariance matrix based on 
its spatial properties [81]. However, there are various covariance matrices that can be used 
which may be more effective for the ATAN problem. 

Similar terrain can be modeled in a similar fashion. Using a different GPM for areas may 
lower computational cost and also increase the overall spatial estimation that occurs within 
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the GPM. By modeling smaller areas as separate GPMs, the computational complexity 
decreases. Further, with different models, each model can be refined to its specific terrain 
type, possibly making it more accurate. A possible approach to combining GPMs is through 
a Bayesian Committee Machine (BCM) [41]. Using the combination of multiple smaller 
GPMs would reduce the overall computational complexity and potentially result in a more 
accurate GPM. 

9.3.3 Exploitation 

Exploitation uses a MAP estimate to evaluate trajectories using the best hypothesis for 
position estimation to determine localization through terrain evaluation. This thesis uses 
a CDF function based on entropy values of the terrain to determine localization probabil¬ 
ity. Refining this model to a linear or non-linear function that quantifies the localization 
probability as a function of entropy using actual terrain data would potentially increase the 
accuracy of ATAN within all environments. 

Additionally, the terrain evaluation through the exploitation phase can be broken up into 
two different methodologies. In this thesis we considered a distance approach, sampling 
the terrain at constant distances over the trajectory to determine the best trajectory for 
localization. Another methodology is using cell decomposition and using every cell the 
trajectory passes over to determine the exploitation score which may yield more accurate 
results. 

Lastly, in this work, we utilized a MAP estimate for trajectory evaluation instead of a MLE 
which accounts for uncertainty of the vehicle position. Using a stochastic system to quantify 
the error associated with the Boltzmann entropy through the use of a GPM could give higher 
fidelity on exploitation planning. However, the computational complexity of the problem 
will increase dramatically by adding an additional GPM. 

9.3.4 Trajectory Planner 

The trajectory planner utilizes a POMCP to generate and evaluate trajectories. While this 
approach was shown to be successful, there are multiple ways to increase the ability of the 
UV to complete the coverage problem in a faster time and with lower vehicle error. 
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First, ATAN can be improved with inspection planning in the terminal stages. When the 
coverage reaches a certain limit, the POMCP can switch over to an inspection planning 
algorithm to selectively cover the remaining areas. 

Additionally, in this thesis we used a constant trajectory of fifty meters and constant heading 
rate changes throughout the entire problem. Another improvement of the POMCP can be 
the biasing of the trajectory length and heading rates as a function of exploration and 
exploitation , or the optimization of trajectory lengths and heading rates as a function of 
exploration and exploitation. 

Furthermore, the POMCP utilizes heading rates instead of the preferred straight line paths 
for conducting mapping. The POMCP can be easily adapted to evaluate trajectories that 
perform quick maneuvers followed by straight paths to optimize the sensor aspect. The 
harder development would be the implementation of a sensor system that can perform 
mapping while the sensor is turning. 

Lastly, in this thesis we used the maximum reward, most visited child as the selection 
criterion for the MCTS. The ability to quickly cover an area while minimizing AUV 
position error may improve if an adaptive approach to selection is used instead of a fixed 
child. 

9.3.5 Computational Complexity 

In this thesis we utilized multiple platforms ranging from Intel i9 and Intel ilO processors to 
Amazon Web Services (AWS) Elastic Computer Cloud (EC2) nodes in order to compile the 
data used for analysis. After multiple runs on the Intel i9 processor, an average number of 
simulations within the MCTS was determined and used for all data collection on the various 
platforms, with only ten simulations for each trial achieved. Future work could consider 
a higher number of simulations to determine a better biasing for the MCTS which may 
include a function for c based on time or the requirement for exploration or exploitation. 

Future work is needed to evaluate the optimal trajectory based on running all trajectories 
discussed in Section 8.2.3. In this thesis, we attempted to determine the optimal trajectory 
but were unable to due to limited resources and time. 
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9.3.6 In-Water Results 

Lastly, and most importantly, future work can be used to provide in-water results on the 
REMUS vehicle, using ATAN. While the simulation of ATAN proved successful, an 
in-water result would solidify ATAN as the way to achieve autonomy in the future. 

9.3.7 Applications Outside of the Undersea Domain 

In this thesis we model ATAN through the use of an AUV. However, ATAN is not limited 
to the undersea domain and has applicable approaches to any environment. Another aspect 
of future work revolves around incorporating ATAN into different types of UVs. The 
exploration-exploitation interaction remains the same, but the parameters of the vehicle and 
terrain will be matched to different environmental and technological constraints. 

9.4 Concluding Remarks 

ATAN takes advantage of the exploration-exploitation dilemma in real time through the use 
of a POMCP. In this thesis we provide notable contributions to the field of robotics, and 
provide a path forward to further improve ATAN. Through ATAN and these improvements, 
greater levels of autonomy are exhibited, improving upon the TAN framework but also 
providing a larger offering to the field of robotics. 
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APPENDIX A: 
Variogram 


This appendix shows the mathematical relationship between the variogram and the covari¬ 
ance. 


A.l Variables 

The following variables are used: 

• h is the relative distance between the test point and the measured points 

• C(h) is the covariance as a function of h 

• x is the location of the test or measured points 

• 2 y(h) is the variogram function 

• Z(x ) is the observation at location x 

• m x is the mean observation, z, based on location, x 


A.2 Variogram 

With the assumption of second-order stationarity, the covariance C(h) and variogram 2y(h) 
can be written as [80] 


C(h) = E [Z(x + h) ■ Z(x)\ - ml Vx (A.l) 

2 y(h) = E [Z(x + h) - Z(x)] 2 = C( 0) - C{h) V.v. (A.2) 

The variogram asymptotically approaches the sill, which is equal to the initial covariance 
[80], as shown in Figure A.l. 
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(a) (b) 

Figure A.l. (a) Variogram-Covariance Relationship, (b) Variogram Termi¬ 
nology. Source: [82] 


Looking at Figure A.l, the maximum value of the variogram can be written as [80], 


r(oo) = v(z(*)) = c( 0). 


(A.3) 


This can then be utilized to solve for the covariance as a function of the variogram, 


C(h ) = y(inf) - y(h). 


(A.4) 
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APPENDIX B: 
Kullback Leibler Divergence 


This appendix derives the KLD for two Gaussian distributions. 
Distributions p(x) and q(x) are given as Gaussian distributions: 


p(x) ~ N(n\,(rl) 


q(x) ~ N(/u 2 ,o-;). 


The KLD of two distributions can be written as: 


KL(] p||q) = 


A Gaussian distribution y(x) ~ N{p,cr 2 ) can be represented through its probability density 
function: 


liner 2 


1 / *-fJ \2 
- e 2V a- > . 


Using equation B.4, p(x) and q(x) can be written using their mean and variance (equations 
B.l and B.2): 

1 i / *->*\ a 

„ 2Ver,/ cd 


P(x) = 


j2ncr 2 


1 i ( \2 

2 V CT'i ' 


_ e 0-2 




Inputting equations B.5 and B.6 into equation B.3, 


KL(p\\q) = J p(x)log —^ 


1 (x-nq 2 

yjlncr^ e 2 °b 


V 27 r£r 2 2 


e 2 
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Simplifying equation B.7, 


^Mpllq)= [ P(x)log 

j v °T 7 


(x-nir 

1 cr;\ 0.5 e 


(x-u 2 r 


\dx. 


2cr~ 


(B.8) 


Using the properties of the logarithm, equation B.8 can be re-written as 

\2 _ ,.\2 


^B(p||q) 


/ p(x)log 

°^\dx+ f p(x) 

7 

Wil J 


(x- H\) (x-f*2) 


2 cr 2 


2 tr 2 2 


dx. 


(B.9) 


The first term in equation B.9 includes a constant, log(o"| /cr 2 ), which can be moved outside 
the integral, which leaves f p(x)dx = 1. Using the properties of logarithms, the first term 
can then be written as 


u 


p(x)log 


1 

~2 K X = 7 lo § 

of/ 2 


\°i 


(B.10) 


The second term in equation B.9 is more complex: 


/ 


P(x) 


(x-/il) (x — P2) 


2 (T“ 


2 <T 2 


Jx = 


2 a - 2 


+ 


2 m 2 


J (x-yUl) 2 p (X)dx 

/(X- // 2 ) 2 p(x)jx 


(B .11) 


By definition, /(x - p\) 2 p(x)dx = o' 2 which leads to equation B.ll to be: 



(x — p \) 2 (x — Pi)~ 


2cr 2 


2 a - 2 


, 1 
Jx =-- + 


2 a -2 2cr 2 


/ 


(x - p 2 ) 2 p(x)dx 


(B.12) 
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Adding in a positive and negative H\ to the square term of equation B.12 



(x-/a ) 2 (x - m) 2 

2 CTj 2 2 cr^ 





X-Hl+lUl -yU 2 ) 2 p(x)<ix 


(B .13) 


Equation B.13 can be re-written as: 



(x - Hi) 2 (x- H 2 ? 


2 CTj 2 


2^2 


, 1 1 
dx = -- + 


2 2 a" 


/ 


(x - //i)“p(x)<ix 


+ Oi 1 - ^ 2 ) 2 / p(x)Jx 

J (x-mM 


+ 2(m - hi) 


(B.14) 


Equation B.14 can be simplified using the following definitions, 

/(x - a/i) 2 p(x)Jx = of 

f p(x)dx = 1 

f(x - //i)p(x)dx = 0, 

which results in 



(x-/^l ) 2 (x - jUo ) 2 

2 cr“ 2 o-^ 




CT 2 + (yUl - yU 2 ) 2 


(B .15) 


Inputting equations B.10 and B.15 into equation B.9 gives 


1 . AA 


^B(p||q) = -log( 


1 1 


2 2(T 2 


CT 2 + (jll ~ Hi) 1 


Equation B.16 can be simplified to the equation used in Section 4.3: 


£L(p||q) = log 


/ 02 \ + oj + Ol - HI ) 2 _ 1 


\^1 


2 a - 2 


(B.16) 


(B.17) 
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APPENDIX C: 
Boltzmann Entropy 


This appendix shows all Boltzmann entropy values shown in Table C.l. 

Table C.l. Boltzmann Entropy and Relative Localization Probability. 


Boltzmann Entropy 

€( m n) 

Localization Probability 

p(X* |£(m„)) 

1 

13.7 °7c 

50 

32.9 % 

180 

90.2 % 

300 

99.8 % 

2160 

100 % 
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Latitude, 


C.l Boltzmann Entropy Maximum of 1 


23.1 


23.05 


23 


22.95 


22.9 



(a) Depth Profile. 



Longitude, [deg] 

(b) Entropy Profile. 

Figure C.l. Boltzmann Entropy Depth Profile for max(£(m)) = 1. 
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Latitude, [de 


C.2 Boltzmann Entropy Maximum of 50 




19.4 19.2 19 18.8 18.6 18.4 18.2 


a> 


1 

-121.8694 -121.8695 -121.8696 

Longitude, [deg] 

(b) Entropy Profile. 

Figure C.2. Boltzmann Entropy Depth Profile for max(£(m))= 50. 


Longitude, 


36.6139 

36.61395 

36.614 

-121.8696 36.61405 Latitude, [deg] 


(a) Depth Profile. 


18 
£, 18.5 

tL 19 

<u 

Q 19.5 
-121.8694 


-121.8695 


[deg] 
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C.3 Boltzmann Entropy Maximum of 180 


21.2 21 20.8 20.6 20.4 20.2 20 19.8 

—i-1-1-1 -1-1— 



21.5 
-121.8703 

-121.87035 

-121.8704 

-121.87045 


36.6145 


36.61455 


36.6146 


Longitude, [deg] 


-121.8705 36.61465 


Latitude, [deg] 


19.5 


20 - 


_c 20.5 - 

4-1 
CL 
0) 

Q 21- 


(a) Depth Profile. 



-121.8703 -121.87035 -121.8704 -121.87045 -121.8705 

Longitude, [deg] 

(b) Entropy Profile. 

Figure C.3. Boltzmann Entropy Depth Profile for max(£(m)) = 180. 
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Longitude, [deg] 


C.4 Boltzmann Entropy Maximum of 300 


19.55 19.5 19.45 19.4 19.35 19.3 19.25 



36.61436 36.61434 36.61432 36.6143 36.61428 36.61426 36.61424 36.61422 

Latitude, [deg] 

(a) Depth Profile. 



(b) Entropy Profile. 

Figure C.4. Boltzmann Entropy Depth Profile for max(£(m))= 300. 
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C.5 Boltzmann Entropy Maximum of 2160 



(a) Depth Profile. 


36.6142 



(b) Entropy Profile. 

Figure C.5. Boltzmann Entropy Depth Profile for max(£(m))= 2160. 
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APPENDIX D: 

Derivation Of Optimal Estimator 


This appendix shows the derivation of the optimal estimator r used in the reward function 
shown in Chapter 6 . This dual optimization is between a spatial estimator, Exploration , and 
a temporal model. Exploitation , to best choose the trajectory of the AUV. 

D.l Variables 

The following variables are used: 

• r is the reward, the true change of uncertainty 

• k\ is the Exploration gain 

• v\ is the uncertainty around of the Exploration score and is modeled as zero mean 
white noise 

• z 1 is Exploration score with uncertainty vi, z\ = r + v\ 

• lc2 is the Exploitation gain 

• V2 is the uncertainty of the Exploitation score and is modeled as zero mean white 
noise 

• Z2 is the Exploitation score with error vj, Zi = r + vo 

• f is the estimated reward 

• f is estimated reward, f = f - r 

D.2 Solution 

Forming a possible solution, 

r = k\z\ + k2Z2, 

the mean value of the estimated reward r can be calculated by 

E[r] = E[f - r] = 0 . 


(D.l) 


(D. 2 ) 
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Inputting equation D.l into D.2, 


0 = E[k\Z\ + k 2 Z 2 \ ~ E[r] 


(D.3) 


Since r is the true reward value, E[r] = r, so equation D.3 becomes 

0 = E[k\z\ + k 2 Z 2 ] ~ r 

0 = E[k\{r + vi) + k 2 (r + V 2 )] - r . (D.4) 

0 = E[ki(r + vi)] + E [k 2 (r + v 2 )] - r 


Evaluating each term in D.4 individually, 

E[k\(r + vi)] = rE[/:i] + E[£ivi] 
E[/:i(r + vi)] = rk\ 


(D.5) 


E[k 2 (r + v 2 )] = r E[k 2 ] + E[^ 2 V 2 ] 
E [k 2 (r + v 2 )] = rk 2 


(D.6) 


In equation D.5 and D.6, E[^ivi] = E[fc 2 v 2 ] = 0 since v\ and v 2 are zero mean Gaussian 
noise. Inputting the results from equations D.5 and D.6 back into equation D.4, 


0 = k\r + k 2 r - r. 


(D.7) 


Solving for k 2 as a function of k\. 


r - k\r 

k 2 =- 

r 

k 2 - 1 - k\. 


(D.8) 


The estimated reward, r, can be written using equation D.l. Inputting D.8 into equation 
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D.l, 


r = k\Z\ + k 2 z 2 
r = kizi +(1 - h)z 2 - 

Computing the estimated reward error, r, 

r = k\(r + vi) + (1 - k\)(r + v 2 ) 
r = k\V\ + v 2 - k\v 2 . 


(D.9) 


(D.10) 


Computing the variance of the system, E[r 2 ], 

E[f 2 ] = E [k\v\ + 2v 2 (1 - k\)k\v\ + v\{\ - iki) 2 ]. (D.ll) 


Equation D.ll can be simplified since E[vi] = E[v 2 ] = E[viv 2 ] = 0 since v\ and v 2 are 
uncorrelated. Additionally, the variance of the systems, E[v 2 ] = cr 2 and E[v 2 ] = cry, 
leading to 


E[f 2 ] = E^ 2 cr 2 + (1 - k\) 2 cr y. 


(D.l 2) 


Finding the minimum variance of the system with respect to k \, 


d E[f 2 ] 
dk\ 


= 0 , 


(D.13) 


and applying equation D.l2: 


0 = 2k\ cr 2 - 2cry + 2&icr 2 . 


Simplifying equation D.14 and solving for k\\ 


k] 


cr r 



(D.14) 


(D.l 5) 
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Using equations D.8 andD.15 k 2 can be solved for, 


k 2 


cr r 


°-l + (T 2 


(D.16) 
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APPENDIX E: 
Full Trajectory 


E.l Result Trajectory 




36.616 

36.6155 

CU 

-g 36.615 
0) 

3 36.6145 

36.614 

36.6135 


Depth [m] 
16 

17 

18 
19 


-121.87 -121.869 -121.868 -121.867 

Longitude, [degrees] 


(a) 500m (Cov 21.0, posit, error 0.23m). (b) 1500m (Cov 55.3, posit, error 0.36m). 




(c) 1750m (Cov 63.6, posit, error 0.17m). (d) 2000m (Cov 71.2, posit, error 0.63m). 


Figure E.l. POMCP Result Trajectories for 500m to 2000m. 
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Latitude [degrees] Latitude [degrees] 




(a) 2250m (Cov 75.5, posit, error 0.11m). (b) 2500m (Cov 77.6, posit, error 0.18m). 




(c) 2750m (Cov 81.4, posit, error 0.71m). (d) 3000m (Cov 86.7, posit, error 0.94m). 




(e) 3250 (Cov 87.5, posit, error 0.33m). (f) 3500m (Cov 90.5, posit, error 1.1m). 

Figure E.2. POMCP Result Trajectories for 2250m to 3500m. 
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Latitude [degrees] Latitude [degrees] 




(a) 3750m (Cov 91.7, posit, error 0.1m). (b) 4000m (Cov 93.7, posit, error 0.22m). 



-121.87 -121.869 -121.868 -121.867 

Longitude, [degrees] 



(c) 4250m (Cov 94.1, posit, error 0.25m). (d) 4500m (Cov 95.7, posit, error 0.72m). 



-121.87 -121.869 -121.868 -121.867 

Longitude, [degrees] 



-121.87 -121.869 -121.868 -121.867 

Longitude, [degrees] 


(e) 4750 (Cov 95.8, posit, error 0.13m). (f) 5000m (Cov 96.1, posit, error 0.24m). 


Figure E.3. POMCP Result Trajectories for 3750m to 5000m. 
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Latitude [degrees] Latitude [degrees] 



36.616 

36.6155 

CU 

v 36.615 

O) 

FI 36.6145 


-121.87 -121.869 -121.868 -121.867 



-121.87 -121.869 -121.868 -121.867 


Longitude, [degrees] Longitude, [degrees] 

(a) 5250m (Cov 96.6, posit, error 0.17m). (b) 5500m (Cov 97.8, posit, error 1.1m). 



36.616 

36.6155 

QJ 

" 36.615 

OJ 

36.6145 



-121.87 -121.869 -121.868 -121.867 

Longitude, [degrees] 


-121.87 -121.869 -121.868 -121.867 

Longitude, [degrees] 


(c) 5750m (Cov 98.6, posit, error 0.28m). (d) 6000m (Cov 98.7, posit, error 0.13m). 



-121.87 -121.869 -121.868 -121.867 

Longitude, [degrees] 


36.616 

36.6155 

O) 

o> 

% 36.615 
aj 

3 36.6145 
ro 

36.614 

36.6135 



-121.87 -121.869 -121.868 -121.867 

Longitude, [degrees] 


(e) 6250m (Cov 98.8, posit, error 0.16m). (f) 6500m (Cov 98.8, posit, error 0.09m). 

Figure E.4. POMCP Result Trajectories for 5250m to 6500m. 
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Latitude [degrees] 



-121.87 -121.869 -121.868 -121.867 

Longitude, [degrees] 



-121.87 -121.869 -121.868 -121.867 

Longitude, [degrees] 


(a) 6750m (Cov 98.8, posit, error 0.11m). (b) 7000m (Cov 99.4, posit, error 0.29m). 
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^ 36.6155 

QJ 

01 

i_ 
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5 36.6145 

ro 

_i 

36.614 
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-121.87 -121.869 -121.868 -121.867 

Longitude, [degrees] 

(c) 7250m (Cov 100, posit, error 0.34m). 

Figure E.5. POMCP Result Trajectories for 6750m to 7250m. 
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(a) 
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-0.06 


Longitude, [degrees] 


(b) (c) 

Figure E.6. (a) AUV Generated Map. (b) AUV GPM at Completion of 
Trajectory, (c) AUV Map Error Comparison to CSUMB Map [78]. 
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